I have been working on Machine Learning since my third year in college. But, during this time, the process always involved taking the dataset from Kaggle or some other open-source website. Also, these models/algorithms were either there on some Jupyter Notebook or Python script and were not deployed to some production website, it was always localhost.
While interning at HackerRank and also after starting as a Software Engineer here as a part of the HackerRank Labs team working on a new product, I got a chance to deploy three different ML Models to production, working end-to-end on them. In this blog, I will be sharing my learnings and experience from one of the deployed models.
The tasks involved collecting the data, data cleaning, analysis and visualization, model selection, evaluating the model, and finally deploying it. While I thought that once the model is ready, the deployment and integration with the backend will just be a small part of it, it was the other way around. Model development was just a small piece of the entire process, the entire deployment pipeline required brainstorming.
The problem statement that we were trying to solve was based on Supervised Learning.
1. Data Collection: Data plays an important role in any ML task. The project that I was working on required data from Github. Since Github API has a rate limit on an hourly basis, I had to make sure that the rate limit was not exceeded. In order to tackle this, batch processing of data was added in the python script.
2. Data Cleaning, Analysis, and Visualization: The data from Github was in raw format, using pandas to modify the data to fetch the required variables, handling missing data. Once the variables were fetched, I found their correlation with the target variable, visualized using a heat map & then decided on the final decision variables. Since there were certain string fields, One Hot Encoding was used. For the purpose of removing outliers, I had tried three different techniques: Log Transformation, Data Normalization, and IQR. Based on the data we had and the visuals using boxplot, IQR seemed to work best.
3. Model Selection, Training & Testing: The problem was a supervised learning problem, hence the model was selected based on that and involved hyperparameter tuning. We used a train-test-split of 80/20 and a Standard Scaler to normalize the data. The accuracy achieved was good enough for our use case.
This was all about the model development. The next step was to integrate the model with the backend.
1. Integrating data collection: Added an additional pass to the existing crawler in Golang to fetch the required fields for the model. This way the already present features won’t be affected and the Model data fetching will become independent.
2. Deciding how to Deploy: The model weights and the Standard Scalers were in bin format. Apart from these, there were other intermediate files generated during training that consisted of certain variables that would be needed during inference. Initially, I had thought of two ways to deploy the model after researching: 1. AWS Sagemaker API for the model, S3 for storing files, and 2. storing the model artifacts & files on S3. But since we already had the model weights in bin format and needed S3 to store intermediate files, we went ahead with the latter.
3. Making sure the model is updated regularly: Since the data in the real world keeps on changing, it’s important to make sure that the model is updated in order to avoid data drift. To tackle this, I deployed a Kubernetes cron that trained the model on a weekly basis.
4. Integration with backend: Since the project that we were working on was in Rails, I used Resque workers that helped in enqueuing jobs, calling the python scripts every time API was hit, and storing the predicted values on Elasticsearch.
Things learned in the process:
1. ML can’t be measured in Sprint’s: When I was assigned the task, we created a Jira Ticket and it remained in progress for about three months. After a month it started to become overwhelming to see the ticket in the same state, but with time I understood that in ML, model development & deployment is an iterative process, it’s about experimenting with different approaches, evaluating different pipelines and you can’t have concrete results each Sprint.
2. Ownership: Taking ownership of the deployment of the Model is important. When we build a model in a Jupyter Notebook, that is only the first step. It is our responsibility to convert those notebooks to a pipeline that can be used in the production and be scalable at the same time, evaluate different deployment methods to find a suitable one, and be aware of the software engineering concepts.
3. Monitoring: After our model was newly deployed, the new data that once came in had some null values that were not handled in the data cleaning script. While I was monitoring the weekly cron, I saw a dip in the accuracy that helped me handle this edge case. Keeping a track of the model accuracy after each training cron in production is essential to make sure that when the model gets trained on new live data, there is no data drift and the accuracy that we were able to achieve earlier is not hampered.
4. Logging during Inference: It is good to have a logging system setup during API requests to the model/inference so that we can track the amount of time taken to clean the incoming data, fetch the required variables and finally provide the prediction value. This helps in recognizing the bottlenecks in the pipeline.
5. Documenting: During the initial development of the ML Model, not everyone was aware of the parameters used to develop the model and the process followed to finalize those variables. There were certain decision variable definitions for which there were different interpretations. At that time I learned, it’s beneficial to document the meaning of each variable, all the steps, processes, and analyses that lead to the final Model. Decision variables, approach, evaluation metrics, testing scenarios, deployment plan, and the problem statement, document everything.
6. Codebase: Having a good understanding of the entire codebase where the model is going to be deployed is valuable. This is to make sure that while adding new data fields for the model or setting up the inference pipeline, we don’t change the format or performance of existing features. Also, it helps in building pipelines that can be scaled easily.
7. Understanding the end-users problem: After we had developed the model, we realized that even though the model was working in an expected manner, a slightly tweaked version would be more beneficial for the end-users. Before starting with the development of the model, it’s profitable to discuss with the stakeholders about the problem statement that we are trying to solve, what will be the key result for it, and make sure it aligns with their definition.