Steps For Building Predictive Models

Dec 13, 2021
4 min read

The use of predictive analytics (PA) models is on the rise. It is intended to drive increased sales and provide data to refine marketing strategies. The use of these models is thus becoming a common practice to measure, for example, consumers' desire or purchasing capacity and, therefore, calibrate scales and communication campaigns.

The Organization of the PA Model

The organizational model of the PA function is to build a complete PA team. An alternative to this model is to use tools to automate specific time-consuming tasks. Examples of this platform type include DataRobot, Alteryx, RapidMIner, and WTW Emblem for non-life insurance pricing. This type of tool aims to alleviate the need for specialized PA resources by facilitating the construction of models.

Regarding the development of a predictive model, the starting point is to identify the needs of the business, driven by the project owner. This will influence the choice of model used, all other things being equal.

Below is a list of the usual steps to follow for any PA task:

Integrate the PA initiative into a business context
Define data needs
Clean data
Please choose a model and develop it from the available data
Test the model
To throw it

The PA Models: What are the Benefits?

This type of tool is not going to help solve business problems or find data. Nor will it replace field knowledge, the need for a data dictionary, or translate the "bad data, bad results" problem. On all these points, it is up to the user to do their job.

What the tool does is:

Automate the first data cleaning (missing values, creation of age groups or others, a cloud of keywords to encode text into variables, detection of outliers, etc.),
Build all possible and imaginable models on a sample of data (PCA, ARMA, generalized linear model, decisional tree forest, gradient boosted trees classifier with early stopping [3], etc.,
Test and validate the second sample of data, and classify the models according to their prediction capacities.
The tool adds the appropriate panoply of additional features: visible code, technical documentation in MS Word, graphical representation of the predictive capacity of each variable under each model (with dependency analysis), and many more not listed here.

From there, it's the business's turn to take control. First, the model must indeed be understood to be fully assimilated. Then, the dark stateside must be revealed through the generated documentation, individual scenarios, and other graphics. These steps can be reused to communicate the model to different stakeholders and obtain their consent. Specific points must also be finalized, such as the realization of the documentation, before the actual launch.


You have data files from this department, but you may need more data from internal or external sources (e.g., a data broker ...). Make sure you have the skills of a lawyer who will check the compatibility, with the GDPR or any other legal constraint, of transfers of data from one legal entity to another.

Target value

What this means is the overwhelming need to define your needs. Let us mention two illustrations in this context:

In the first, we notice that we are not looking to qualify transactions but groups of transactions that fall within the scope of a single survey.

In the second, we construct an indicator highlighting the transactions covered by the survey and the result of the study (amount recovered net of the cost of the survey). The file can then have an indicator with three positions: "Not surveyed," "Positive recovery," "Negative recovery."

Calculation With All Possible Predictive Models, Classification of Models, Additional Elements

The modelling work can now begin. The data should be separated into several subfiles: one for calibration, one for testing, one for comparing models. As a series of tests are carried out, it is necessary to apply a more refined segmentation and verify that the data used on one side is not used.

Then comes the step of finding the correlated variables, doing a PCA, and testing linear models, or even somewhat complex models.

We must then choose the model that works best after having clarified the classification criteria.

Finally, it remains to document the chosen model and write the equations that underlie the model's theory. All of these steps can be performed by API platforms.

But the model must be understood by the owner. You have to take ownership of what the tool has proposed. Did he show a correlation à la Nicolas Cage? Or is it an illustration of the parable of the blind and the elephant (Anekantavada) [4], in which everyone guesses the whole by touching a part (if one feels the leg, the elephant looks like a tree?; if you touch the ear, the elephant looks like a fan; if you touch the trunk, the elephant looks like a snake…)?

Does the model over-represent the data-rich aspects but forget the fundamental elements incorrectly covered by the file? Is the phenomenon stable enough for the results to be exploitable (for example, everyone who buys a yellow jacket owns a car and is risk-averse)?

The fraud detection team must then perform the launch of the model on the accurate data. Are the data still available? Will the model never be used out of context? Will the documentation be updated? Do stakeholders understand the model well enough to continue using it? This substantial work is critical for an exemplary deployment of the tool.

End Note

No, this type of platform isn't the silver bullet to all of your problems, but it can lower your costs and alleviate the concerns that building an AP team can raise in terms of how much money to invest.