Preparing for a Real-Life Machine Learning Project
In the previous post, I shared with you my trial with six different machine learning models and being able to measure their accuracy scores.
In the preceding Insurance dataset post, I had left out Strings data (text). In this post, we will revisit the scores of the models after including the string data, which will help us in determining the values in these strings that the algorithms would enjoy.
As always, we follow the following series of steps:
Step 1- Load the Data.
Step 2- Check Data Types & Standardize the data types.
Step 3- Split the data into Test & Training sets.
Step 4- Run multiple algorithms and compare the Accuracy Score.
Step 1 – Load the Data
This time, we load the large CSV file which has both strings and integers. We have also kept the same logic based on which we arrived at the different labels of ‘Excellent/Good/Poor’ drivers.
As before, we use the print() command to verify if the pd.read_csv works successfully. The screenshot below proves that this is the case.
We now wish to view the different data types the various ‘features’ are in.
Pandas support a data type method (d.type) that allows us to query the data type in a single line of code. To verify whether we executed the d.type command, we once again rely on the print() command. The output on the screen confirms the data types in our file to be a mix of integers, float, and objects (strings). Algorithms in scikit-learn understand only numeric data. So, we need to convert all object data into numeric data.
As this is a common problem faced in machine learning, we will use a pre-programmed command ‘Label-Encoder()’ method. Using this method converts string data into numerical data.
We implement the ‘Label-Encoder’ in the following manner:
From sklearn import preprocessing
#import the necessary module from sklearn import preprocessing# create the Labelencoder object = preprocessing.LabelEncoder()#convert the categorical columns into numericencoded_value = le.fit_transform(["basic", "premium", "extended"])print(encoded_value)
Please note, this method assigns the numeric values to the classes. The classes are assigned with values in the order of their first letter, as in the original list.
The output on the bottom right of the screen tells us that ‘basic’ transformed into 0, ‘extended’ transformed into 1, and ‘premium’ transformed into 2.
Now, to transform all the different columns and all the unique entries, we use the following code:
print("'State Code' : ",file['State Code'].unique())
print("'Education : ",file['Education'].unique())
print("'Gender : ",file['Gender'].unique())
print("'Location Code : ",file['Location Code'].unique())
print("'Marital Status : ",file['Marital Status'].unique())
print("'Sales Channel' : ",file['Sales Channel'].unique())
print("'Vehicle Class : ",file['Vehicle Class'].unique())
print("'Vehicle Size' : ",file['Vehicle Size'].unique())
And upon running the program, we can view the unique classes for each column in the screenshot below:
We now use the encoded_value = le.fit_transform(["basic", "premium", "extended"]) command, and convert all the strings into numeric values. Upon running the program, we observe that there are no errors:
To verify if the data conversion has taken place appropriately, we use the head() function to examine the first few lines of our data set. To replicate all the remaining columns of string, we use the following code:
file['State Code'] = le.fit_transform(file['State Code'])file['Coverage'] = le.fit_transform(file['Coverage'])file['Education'] = le.fit_transform(file['Education'])file['Gender'] = le.fit_transform(file['Gender'])file['Location Code'] = le.fit_transform(file['Location Code'])file['Marital Status'] = le.fit_transform(file['Marital Status'])file['Sales Channel'] = le.fit_transform(file['Sales Channel'])file['Vehicle Class'] = le.fit_transform(file['Vehicle Class'])file['Vehicle Size'] = le.fit_transform(file['Vehicle Size'])
Upon printing the file (print(file.head())), we can see that all strings got converted to numeric outputs.
Step 2 – Train the Data
Like we did in the previous exercise, we load the file into an array.
Once the data loads into an array, the data can split into ‘Training’ data and ‘Testing’ data. As we did last time, we will train the data on 80% of the data set and test it on the remaining 20%.
To do so, we use the following commands:
array = file.values
X = array[:,0:17]
Y = array[:,17]
Upon running the program, there are no errors reported, as shown in the screenshot below:
To confirm the correct selection of X (features) and Y (labels), we use the print command:
We re-use the commands that we used the last time to split the data to ‘Train’ and ‘Test”. We import the module from the sklearn library. We use a ‘random state’ in our data splitting to ensure that the different algorithms we use receive the same amount of data to avoid any bias.
from sklearn.model_selection import train_test_split
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=0.80, random_state=1)
Running the library does not result in any errors, as evidenced in the screenshot.
At this point, our data has successfully split into Training and Testing data.
Step 3 – Run Multiple Models
Again, as we did in the last post, we use the following six models for testing and evaluation:
· Linear Models:
o Logistic Regression (LR) – The model looks for a pass/fail relationship between variables.
o Linear Discriminant Analysis (LDA) – Looks for a linear relationship between two or more classes of objects.
· Non-Linear Models
o K-Nearest Neighbors (KNN) – A pattern recognition used for both regression and classification
o Classification and Regression Trees (CART) – Use of ‘decision trees’ to go from observations to conclusions
o Gaussian Naive Bayes (NB) – Calculation of pre and post probabilities approach
o Support Vector Machines (SVM) – Supports labeling of unlabeled data in an unsupervised environment
To run these six models concurrently, we can treat these models as a list of models to be used using .
Add the following to your code:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
#importing the models
models = 
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
#adding the models
To append these models to a list named models, we must import the models from sklearn. We once again run our code and find no errors. Hence our code should look like as follows:
Step 4 - Check Accuracy Score for each Model
To check the accuracy of each of the six models, we create a list titled results and names.
results = 
names = 
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
We will also have to import:
from sklearn.model_selection import cross_val_score (to Evaluate a score by cross-validation) and
from sklearn.model_selection import StratifiedKFold (Provides train/test indices to split data in train/test sets)
Upon running the program, we can see the results for the different models we used in the screenshot below:
If we compare the results of the previous exercise and this, we will find that with an increase in data ingestion with the algorithm, the accuracy of predictions has increased.
The above table does provide evidence that data points contained in strings when converted into numeric values have assisted in increasing the accuracy of the algorithms.