We talk a lot with the data science teams of industrial companies, and it is striking how unaware they usually are of the gap between a good machine learning model and a production-ready machine learning application.
This is one reason why, according to recent Gartner research, less than half of machine learning models are fully deployed.
So what is required to productize machine learning models? Here are our top four suggestions.
1. Model flexibility
The perfect model doesn't exist, but there is a minimum set of real-life circumstances that your model should accommodate without major redesigns: the integration of new inputs, moderate variations in data regimes, changes in the frequencies of inputs and outputs...
Here is what you could do to improve model flexibility:
- Don't use data that wouldn't be available in production when prototyping.
- Make sure your cross-validation procedure is bulletproof, especially if you are working on streaming data.
- It helps if you can combine multiple models with complementary strengths and weaknesses (through stacking or aggregation procedures), instead of staking everything on a single, supposedly badass model.
2. Code quality
The code quality of machine learning applications straight out of the data lab is often poor, for two main reasons.
- Model training, validation and testing is exploratory and iterative, i.e. inherently messy. If you don't make conscious quality efforts right from the start, chances are you will end up with unwieldy code.
- Data scientists are not software developers. These are two different jobs, competencies, personalities... It is unrealistic to expect production-ready code from unassisted data scientists.
So what's a scrupulous data scientist to do?
- Notebooks don't scale. Unless you are doing pure exploration, you should always work with a version control repository manager (e.g. GitHub or GitLab).
- Assume that Murphy's Law will apply in production - anything that can go wrong will go wrong. Invest in monitoring (error detection and alerting), and use a code logging tool.
- If you are aiming at a production-ready application, is pure open-source really appropriate? Hybrid approaches, combining open-source algorithms and commercial software packaging, may save you a lot of trouble.
3. Computational performance
Your data lab is warm and cozy, while the real industrial world is nasty and brutish. Run time may come second to accuracy in the lab, but it is critical in production - especially for Continuous Intelligence applications. Countless machine learning projects fail because of poor run time.
Here is what we do on our own projects to make sure we are safe:
- We try to find the right balance between model complexity and accuracy. We will typically test that early in the project on a limited data set, before going full optimization.
- In the same spirit, we measure feature contribution to model accuracy carefully, and discard the features with a high cost/benefit ratio.
- Parallelization and distribution will help, although they are not the panacea that people sometimes expect. But they are quite easy to implement with tools such as Dask.
- We regularly check that our run time is shorter than our required output frequency - with a comfortable safety margin. For example if we know that our machine learning application will need to generate outputs every hour, we check that our run time stays under 40 minutes.
Once in production, your model will not live in a vacuum, but as a cog in a complex IT machine. An essential objective (and quintessential DevOps task) of software packaging is thus to facilitate data flows to and from your model - especially since model deviations caused by broken data sources are harder to detect that outright application failures.
Here is our CTO's advice to improve model connectivity.
- Favor databases and APIs over simple files for storing and exchanging data.
- If you can't avoid files, use standard and portable formats (e.g. CSV, JSON, HDF). Avoid Excel files (no comment) and Python Pickle files (that may cause Python version issues). Also make sure to define file format and structure precisely, preferably using a tidy data structure. Finally, anticipate how your files will flow in and out of your application.
- Implement extensive data validation in your code based on the operational requirements of your application. Protect your application from invalid inputs by failing as early as possible.
Datapred's Continuous Intelligence engine automates lots of the tasks mentioned above. Contact us for a discussion of how it would apply to your modeling challenge.
You can also check this page for links to interesting third-party resources on building machine learning applications for streaming data.