Writing an Algorithm with just six lines of Code- Inspired by a Google Developers' YouTube Video From 2016
This post is part-1 of a series of multiple posts that will help us get somewhat familiar with Machine Learning.
In our work lives, many of our projects touch Machine Learning directly or indirectly. Apart from a few folks on the team, seldom do others get a chance to work on an algorithm. The very word, algorithm, brings up images of pages and pages of complicated code. The reality, however, is much different.
The kind of a Machine Learning model we will code today will be a Supervised Learning Machine Algorithm. What this means is that we will feed some sample data (referred to as 'training data') to our 6 line algorithm and get the desired prediction.
As an example, if we were writing an algorithm that was going to predict if an email is a spam or not based on some rules we provide, the entire process would look like the image below.
The different emails coming in would be the input. The classifier or our algorithm would then classify emails as spam or not spam. The resulting segregation would be the output of the program.
In this transcription of a Google Video, I will walk you through all the steps I took to set up the right development environment and then understanding the six lines of the actual algorithm. The use case I have taken here is that of an algorithm that will determine if a new car insurance applicant is likely to be a Great/Average/Bad driver. From a business standpoint, of course, based on the prediction of how good a driver is, a different price can be provided to different applicants.
However, at the end of the exercise, you can save the algorithm and reuse it for any other use cases.
The steps I am describing below have been broken further into major categories:
Step 1 ? Setting up the Development/Coding Environment
Step 2 ? Writing the Algorithm
Step 3 ? Running the Algorithm
Let's get started.
Step 1 ? Setting up the Development Environment
We need to install two things:
A. A programming language to write the Machine Learning Code in and
B. A code editor that helps us detect errors easily as we write this code
You can achieve both objectives by visiting https://www.anaconda.com/. Anaconda is an open-source data science platform that provides access to Python and R, two programming languages, and their editors in a convenient package.
Steps to download Anaconda and getting ready to write code in the Code Editor:
1. Go to the individual version of Anaconda page - https://www.anaconda.com/products/individual
2. Choose the right version of Anaconda to download.
3. Suppose you are on a Windows machine, you can check whether you are on a 32 bit or 64-bit machine by typing in 'system information' in the Windows search bar at the bottom left.
4. Double click on the Anaconda software once you have downloaded it. Choose all the 'recommended options' as you install Anaconda.
5. Access the Anaconda Navigator by using the windows search bar.
6. Click to launch Spyder- the code editor that you will be using to write the code
7. At this point, this is what your screen will look like
8. The Code Editor
As you write code in the left-hand panel, Spyder will let you know if you have entered any wrong syntax.
You are now ready to write code for your algorithm.
Writing the Algorithm
There are three key steps at a macro level to write the algorithm that we need to undertake:
The Training data collection exercise is where the algorithm gets the data from which it can learn. The Training Classifier ingests this Training Data and detects patterns in the data. At this point, the algorithm is ready to make predictions on any input data that it may get.
As a general rule, an algorithm usually gets 80% of all data. We use the remaining 20% of the data to test the algorithm. But more on this will be discussed much later in this series of posts.
At this point, we need to import scikit. Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression, and clustering algorithms.
In-Line 1, all you have to write is 'import sklearn'. Don't worry about the 'alert' that the code editor displays in orange. Any time you wish to run your program, click on the green arrow circled below.
To train the algorithm, we need to provide it with some relevant data. Here is an example of what this input data may consist of in our insurance example:
For this small example, we will stick with the four input variables of Zip Code, Age, Gender, and Car Type. On a live project, of course, there may be many other variables that can be relevant and required.
In terms of the terminology used in the field of Machine Learning, the four input variables of Zip Code, Age, Gender, and Car Type are called 'Features'. The desired output of a Good/Average/Bad driver is called a 'Label.
Let's take a very small sample of 3 records:
Instead of filling in actual Zip codes, Age, Gender, and Car Type with actual values, using numeric values make it easy to write the algorithm. If that is not the case, we would have to deal with integers versus strings, etc. So let's avoid that by using numeric codes for each feature field.
In our simple example, we will input the numeric codes above as a representative value for each feature.
Step 1 ? As we are getting ready to enter now important data, it is a good time to save this file with a real name. Use the file's 'Save As' command and save it as 'car-insurance'.
Please make sure that you change the file type to be as 'Python' when you are saving the file.
Add in the features as line 2 into the algorithm.
features = [[1, 2, 1, 2],[1, 5, 1, 4],[3, 3, 2, 6]]
Your screen should now look like the following screenshot:
Add in-line three, which is the 'labels' for the algorithm to learn.
labels = [0,1,2]
Based on the code above, your excel file/Google Sheet of data should look like as follows:
The program now has the input data that it needs. We now need to train the algorithm.
At this point, we need to import the Decision Tree module from scikit for data to be analyzed using Decision Trees.
A single line of code can easily achieve this-
clf = tree.DecisionTreeClassifier()
The word 'clf' refers to the 'classifier'. Your screen should now look like this:
The red X on line 4 suggests an error. If you hover your mouse over the red X you see the code editor being helpful and providing you the reason for the error:
The error relates to 'tree' not being a defined element. Line 1 needs to be revised from the sklearn to 'import tree' to resolve the error. As you can see from the screen below, the error has now gone away.
At this point, we are ready to trigger the Training Classifier to ingest the rules and detect the patterns from the sample data. To do so, we need to add the following line of code-
clf = clf.fit(features, labels)
The 'fit' here refers to the command which instructs the algorithm to find patterns in the data.
Your algorithm is now ready to accept input data for a sample case and provide a prediction of the kind of driver a prospect is likely to be, based on what it has learned.
Let's assume that a new prospect has the following variables for their Zip Code, Age, Gender, Car Type - 3, 2, 3, 4
To output the prediction for the values above, please input the following line of code-
print(clf.predict([[3, 2, 3, 4]]))
At this point, your screen should look like this:
To run the program, click on the green arrow and see the output on the right side of the pane to view the prediction.
Congratulations!! Your algorithm predicts that based on the values inputted, this new prospect driver will be a bad driver.
Based on this information, we can show this prospect as a higher car insurance quote.
This algorithm can now be saved and reused for other use-cases with modification. In the next post, we will examine how you can analyze results for thousands and tens of thousands of records. Only when we feed in thousands of records, we will see our algorithm properly train itself and be able to provide reliable outcomes.