Preparing for a Real-Life Machine Learning Project 3
We will address here some of the questions that we raised at the end of the last post. As a reminder, these questions were:
A. What if a decision tree is not the best choice for an algorithm? What do we do then?
B. How do we know that this model had the highest possible accuracy score? How would we test the other machine algorithm approaches?
C. How can I tell if some variables have a more direct relationship on the outcome than others?
Before we answer any of these questions, let's spend some time on data preparation. This step is usually where most of the work happens in a live project.
For this project, I sourced the data from the following link: https://opendata.stackexchange.com/questions/7807/where-can-i-find-automobile-insurance-claims-data-set.
The main steps followed concerning data preparation were as follows:
A. Examining the data
The first thing to note upon downloading the data was that there is no label of someone being an Excellent Driver, Good Driver, or a Bad Driver.
The creation of the 'label' was a substantial component of this exercise, as you will see from the steps below.
B. Examining the features
There are a total of 17 columns or 'Features' in this data set. As 'Feature Selection' is an essential part of creating an algorithm, we need to examine which of these 17 features are perfect for consideration in our analysis and which ones are not.
I did not find any description of the areas online, so I created my own using the names of the fields. A summary and a selection of the 'Features' I chose and discarded is as below:
Out of 17 features, we ended up including 5 in a CSV file.
C. Calculating the Labels
I then used the following criteria to arrive at the Labels.
Applying this logic resulted in the following distribution of 'Labels':
Preparing the Algorithm
Step 1 – Load the Data
As this is a CSV file, we need the ability to read a CSV file as the very first step. To do this, we import pandas and use the pd.read command, as listed below:
import pandas as pd
file = pd.read_csv (r'C:\personal\ML\insurance-sample.csv')
As we did before, we let use the print command to verify if everything is OK with the program. As you can see from the screenshot below, the program does not output the contents of the file, but prints line 2 of the program.
We use the print command to verify if the program can access the contents of the file, and based on the screenshot, it can. We can also see that we have 9,134 rows and six columns.
To verify this, we can use the dataset.shape. As you can see from the screenshot below, the dataset.shape outputs (9134, 6), which confirm the results above.
Step 2 – Train the Data
Let's load the file into an array now.
An array is a data structure that stores values of the same data type. In Python, this is the main difference between arrays and lists. While python lists can contain values corresponding to different data types, arrays in Python can only contain values corresponding to the same data type.
Once the data loads into an array, the data can split into 'Training' data and 'Testing' data. As we did last time, we will train the data on 80% of the data set, and test it on the remaining 20%.
To do so, we use the following commands:
array = file.values
X = array[:,0:5]
Y = array[:,5]
Upon running the program, there were no errors reported, as shown in the screenshot below:
We then use the print command to see what the program prints when using X and Y.
Printing X confirms that none of the 'Labels' are being picked up.
Printing' Y' confirms that none of the 'Features' are selected.
To split the data to 'Train' and 'Test' it, we need to import that module from the sklearn library. We will use a 'random state' in our data splitting to ensure that the different algorithms we use receive the same amount of data to avoid any bias.
Running the library does not result in any errors, as evidenced in the screenshot.
At this point, our data split into Training and Testing data.
Step 3 – Run Multiple Models
There are six different models that we can build and test:
· Linear Models:
o Logistic Regression (LR) – The model looks for a pass/fail relationship between variables.
o Linear Discriminant Analysis (LDA) – Looks for a linear relationship between two or more classes of objects.
· Non-Linear Models
o K-Nearest Neighbors (KNN) – A pattern recognition used for both regression and classification
o Classification and Regression Trees (CART) – Use of 'decision trees' to go from observations to conclusions.
o Gaussian Naive Bayes (NB) – Calculation of pre and post probabilities approach
o Support Vector Machines (SVM) – Supports labeling of unlabeled data in an unsupervised environment
To run these six models concurrently, we can treat these models as a list of models to be used using .
Add the following to your code:
models = 
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
To append these models to a list named Models, we must import the models from sklearn. We once again run our code and find no errors. Hence our code should look like as follows:
Step 4 - Check Accuracy Score for each Model
To check the accuracy of each of the six models, we create a list titled ‘results’ and ‘names’.
results = 
names = 
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
We will also have to import:
from sklearn.model_selection import cross_val_score (to evaluate a score by cross-validation)
Upon running the program, we can see the results for the different models we used in the screenshot below:
The two columns in the ‘results’ output are the mean and the standard deviation of the six algorithms.
In the example above, the Classification and Regression Trees (CART) model has generated an accuracy rate of 93% with a slightly elevated standard deviation. The CART model thus would be the best approach, in this case.
In the next post, we will examine how to use strings (most of the other columns in our data sets) in our algorithms and see if that makes a difference as to which algorithm might be the best one to use.