The Datapred blog

On machine learning, time series and how to use them.

Best practices for bulletproof modeling

Posted by Datapred | Oct 30, 2018 9:03:51 AM

Our goal in this post is to discuss our standard strategy (beyond respecting basic time series modeling principles) for building accurate predictive models. We will use the example of commodity purchasing optimization.

 

Tip 1: Formalize your operational objective and target it directly

 

Machine learning always works better when targeting the real industrial objective, not a proxy.

If you are managing a grain mill, your real operational question is not: « What will the price of wheat be in four weeks? ». It is probably closer to: « How should I plan my wheat purchases over the next 4 weeks? ».

To answer the first question, backtesting a standard L1 or L2 regression error will be fine. But to answer the second question, you must backtest, and thus first formalize, the relevant cost function over the corresponding period.

Those are very different machine learning problems, yielding different solutions — the second solution being operationally superior to the first. Implementing it requires extended talks with business experts — one of the reasons why auto-ML doesn’t work for real industrial applications.

 

Tip 2: Compute the relevance of each explanatory variable and model parameter

 

Superior modeling lets you display the influence of explanatory variables and model parameters (e.g. training window size, prediction horizon) over time.

Your commodity purchasing optimization solution could use a sequential and linear combination of multiple predictive models, where each model is specific to: (i) a group of homogeneous variables (e.g. commodity prices, weather forecasts), and (ii) a structuring parameter value (e.g. training window = 1 day, 1 week, 2 weeks).

The relative weight of each model in the combination thus stands for the influence of the corresponding group of variables or parameter value, with the following benefits:

  • Domain experts can check that your solution matches their operational reality (always good for adoption).
  • If your solution underperforms, understanding why is easier, and iterating on potential remedies faster.
  • It helps you reduce the number of variables and parameters in your solution, which increases its reactivity and robustness and accelerates the modeling cycle.
You could realize that a short rolling training window is best for optimizing your cost function, meaning that for those variables, recent observations are more relevant.

Tip 3: Put pressure on your model

 

You know the famous quote about unknown unknowns:

 

 

Industrial life is full of unknown unknowns. By definition, they are not in your historical data, so the only way to prepare for them is to watch how your model reacts to extreme values or totally new circumstances.

Practically, this means you should backtest your model with varying variables, parameters and operational costs/constraints.

Datapred data scientists use two types of tests for unknown unknowns:

  • Robustness tests, where they measure model performance in willingly adverse or plain wrong conditions.
  • Simulation tests, where they assess model performance under neutral, but new conditions.
You could assume that purchase orders based on your commodity purchasing solution are super slow, and check if model performance holds up (robustness test). You could also enter new values for a key explanatory variable, and ask domain experts if the corresponding results are realistic.

***

Datapred Explore is designed for data scientists with projects requiring foolproof time series modeling. Contact us for more information or a discussion of how Datapred could help. You can also check this page for a list of time series modeling resources.

Topics: backtest, modeling, predictive analysis, backtesting, machine learning, time series

Written by Datapred