Podcast: Play in new window | Download | Embed
LM101-033: How to Use Linear Machine Learning Software to Make Predictions (Linear Regression Software)[RERUN]
Episode Summary:
In this episode we describe how to download and use free linear machine learning software to make predictions for classifying flower species using a famous machine learning data set. This is a RERUN of Episode 13.
Show Notes:
Hello everyone! Welcome to the thirteenth podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.
In this episode we will explain how to download and use free machine learning software which can be downloaded from the website: www.learningmachines101.com. Although we will continue to focus on critical theoretical concepts in machine learning in future episodes, it is always useful to actually experience how these concepts work in practice. For these reasons, from time to time I will include special podcasts like this one which focus on very practical issues associated with downloading and installing machine learning software on your computer. If you follow these instructions, by the end of this episode you will have installed one of the simplest (yet most widely used) machine learning algorithms on your computer. You can then use the software to make virtually any kind of prediction you like. However, some of these predictions will be good predictions, while other predictions will be poor predictions. For this reason, following the discussion in Episode 12 which was concerned with the problem of evaluating generalization performance, we will also discuss how to evaluate what your learning machine has “memorized” and additionally evaluate the ability of your learning machine to “generalize” and make predictions about things that it has never seen before.
We will focus on one of the 298 data sets from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/) called the Iris data set which is concerned with the problem of classifying a flower as a member of a flower species based upon physical measurements of the flower. The data sets in the UCI Machine Learning Repository include Breast Cancer data sets, Car Evaluation data sets, Heart Disease data sets, Forest Fire data sets, Internet Advertisement data sets, Tennis match data sets, Diabetes data sets, Echocardiogram data sets, Lense data sets, Ionosphere data sets, Housing data sets, Labor Relations data sets, Molecular Biology data sets, Mushroom data sets, Spam data sets, Space Shuttle O-Ring data sets, and Perfume data sets are just a few of the topics associated with a few of the 298 free data sets available from the UCI Machine Learning repository. Once you download the software associated with this episode you can explore and evaluate the effectiveness of the linear machine learning software to make predictions on any of these data sets or other data sets that you may obtain through other methods.
A computer programming language is essentially a language or recipe for telling a computer what task it should perform. The computer programming language which I have used to construct these algorithms is called MATLAB and it is sold by the MATHWORKS (www.mathworks.com). Different programming languages have different strengths and weaknesses. Some computer programming languages such as C++ is intended to be more general purpose language suitable for a wide variety of tasks. The computer language JAVA is designed to be extremely compatible across a wide range of operating systems. Other computer languages such as PROLOG are designed to assist computer programmers to provide the computer with knowledge in the form of logical rules. Still other computer programming languages such as PERL are useful for text and language processing. The computer programming language MATLAB is designed to provide a simplified approach to specifying complex dynamical systems which might have hundreds, thousands, or even millions of variables. MATLAB is commonly used by engineers and computer scientists when they are developing prototypes in many areas involving smart machines including areas such as: signal processing, image processing, and control systems. There is a free version of a software package which is similar to MATLAB available called OCTAVE (http://www.gnu.org/software/octave) but I have not had the opportunity to check whether the MATLAB code provided on this website will work correctly using OCTAVE. An especially attractive feature of both MATLAB and OCTAVE is that a computer program to implement a machine learning algorithm in JAVA might be much more complicated to develop and maintain. The amount of MATLAB software required to write a machine learning algorithm is typically quite small relative to the amount of software which would be used to write that same computer program in the programming language JAVA. This means that the development and evaluation of complex algorithms is much faster in the MATLAB environment.
In order to run the computer programs on this website, you have several options to consider. First, you can download and install MATLAB on your computer and then download MATLAB software. We will not discuss this option in this podcast but this option is available for MATLAB programmers. Second, you can download and install a computer program which will run on your WINDOWS or MAC OS-X operating systems. This computer program is essentially the same as the MATLAB program but it has been compiled into executable software. However, if you just download this executable software it will not work on your computer because this software requires access to the vast libraries of mathematical functions in MATLAB. Therefore, before you install either the WINDOWS or MAC OS-X version of the learning machine software on your computer, you need to install the MATLAB mathematical function libraries on your computer using a special computer program called the MCR Installer. We will now explain this procedure step by step. You might want to listen to this podcast first, and then listen to the podcast as you follow the various steps.
The first step is to obtain the software download password. This is done by visiting the website: www.learningmachines101.com and joining the learning machines 101 community. Type in your email address, name and interests on the website and enter in the CAPTCHA code and hit submit. In addition to receiving the current password for accessing the latest version of the software, you will receive two emails each month notifying you when the latest podcast episodes has been released. You will typically never receive more than 2 emails per month. Furthermore, we plan in the future to provide additional software, new software updates, special webinars on topics of interest, and special supplemental technical notes. This membership is free and you can cancel your membership at any time.
The second step is to download the appropriate MCR Installer Library by going to the Software menu choice on the Learning Machines 101 website. Once you obtain the software download password using email, then you need to decide what software you should download and install. If you have a windows operating system, then you need to install the Matlab Compiler Runtime Library for Windows. If you have a MAC OS-X operating system, then you need to install the MATLAB Compiler Runtime Library for MAC OS-X. The MCR Installer is a large 300-400 megabyte file which contains a large collection of mathematical functions which need to be installed on your computer. Allow about 20 minutes to 1 hour to complete the one-time installation procedure. After you have followed the directions and installed the MATLAB Compiler Runtime Library on either your windows or MAC OS-X computer, do not delete the installer file. You will need the MCR Installer file to uninstall the MCR Library in the future. Note that you may need to use a program such as WINZIP to unzip the folder containing the MCR Installer.
To uninstall the MCR Library, simply click on the MCR Installer file as if you were installing the MCR Library for the first time. The MCR Installer is smart enough to figure out that since the MCR Library is already on your computer that you probably want to uninstall the software. The MCR Installer will then ask you if you do indeed want to uninstall the MCR Library. Once the MCR Library is uninstalled, then you can delete the MCR Installer file from your computer.
As previously mentioned, the installation of the MCR Library is really the time-consuming step. Once this library is installed on your computer, you will not have to reinstall the MCR Library in the future. A software update from Learning Machines 101 is typically a very small file which is no more than a half a megabyte so future software updates should be very fast and easy.
The third step is to download the appropriate software update by going to the Software menu choice on the Learning Machines 101 website. Download either Windows or the MAC OS-X version of the software. If you already have MATLAB installed on your computer, you might want to also download the MATLAB source code. Even if you don’t have MATLAB installed on your computer, you might want to download the MATLAB source code since it provides relatively easy to read documentation of both the Windows and MAC OS-X software. The size of this file is relatively small (typically less than a half a megabyte). When you unzip the folder containing the software you will find the executable program (which should run on your computer if you have already properly installed the MCR Library). You will also find a file called LICENSE.txt which is the Apache 2.0 software license for using this software. Basically all software provided by Learning Machines 101 is public-domain software which can be incorporated into commercial software projects or free software projects. You, however, are responsible for making sure that the origins of the software are properly credited. We are not responsible for any software bugs or problems. Software bugs and problems should be reported using the contact form on the Learning Machines 101 website (www.learningmachines101.com/contact). An explanation of the software license is provided on the Learning Machines 101 website as well (www.learningmachines101.com/license). The MCR Installers are the property of the mathworks (www.mathworks.com) . You are legally allowed to download and use the MCR Installers on your computer provided that you only use the MCR Installers to run software downloaded from the website: www.learningmachines101.com .
Now for the more interesting stuff. Inside the folder you will see another text file called irisDOC.txt which describes the contents of the data files “testdata.xls” and “trainingdata.xls”. These data files are spreadsheets. Open up the file “testdata.xls” and look at the contents. You will see a spreadsheet which has seven columns of data. Each of the seven columns has a label. The labels of the first two columns are “Setosa” and “Versicolour” which are used to specify three distinct species of the Iris flower: Setosa, Versicolour, and Virginica. Each row of numbers corresponds to the characteristics of a particular Iris flower. So, for example the two numbers in the first row of the file “testdata.xls” are 0 under the category “Setosa” and a 0 under the category label “Versicolour” this pattern of two zeros indicates that the flower belongs to the category Iris Virginica. The second row of numbers has a 0 under the category Setosa and a 1 under the category “Versicolour” indicating that the Iris flower represented by the numbers in the second row correspond to the flower category “Iris Versicolour”. The third row of numbers has a 1 under the category Setosa and a 0 under the category “Versicolour” indicating that the data for the flower specified by the third row of numbers identifies a flower from the species “Iris Setosa”.
In addition, each flower in the database corresponding to each row of numbers in the spreadsheet is characterized by the length and width of its petals as well as the length and width of its sepals. The sepal of a flower is usually a special type of petal or leaf used to protect the flower.The flower represented by the first row of numbers is a member of the species “Iris Virginica” and has a sepal length of 7.7 centimeters, sepal width of 2.8 centimeters, a petal length of 6.7 centimeters, and a petal width of 2 centimeters. The last column is labeled “Intercept” and the purpose of this column will be explained later.
The prediction problem associated with this problem is to use the features “sepal length”, “sepal width”, “petal length”, and “petal width” in order to predict whether the flower belongs to the species: “Iris Setosa”, “Iris Versicolour”, or “Iris Virginica”. We consider a special type of prediction machine which we will call the “linear machine”. The linear machine has some variables which is called parameters. Different choices of these parameters lead to different predictions. Specifically, we will generate a number which specifies the “evidence supporting” the hypothesis that the flower species is “Iris Setosa” and we will generate another number which specifies the “evidence supporting” the hypothesis that the flower species is “Iris Versicolour” given that we know specific measurable features of the flower such as the length and width of its petals and the length and width of its sepals.
In particular, the evidence supporting the hypothesis that the flower species is “Iris Setosa” is defined as weighted sum of the petal and sepal length and width in centimeters plus an additional free parameter called the Intercept where the relative weightings of the petal and sepal length and width measurements in conjunction with the Intercept are free parameters which are numbers which are adjusted based upon the linear machine’s experiences with the world. In a similar manner, the evidence supporting the hypothesis that the flower is from the species “Iris Versicolour” is computed in a similar manner. The Intercept parameter essentially tells the learning machine that it should first learn how frequently each of the three flower species will occur and it should use the measurements of the petal and sepal length and width to distinguish species on the basis of their deviation from the likelihood that a particular species will occur.
The linear learning machine uses the data from the “training data set” to figure out the choice of parameter values. That is, it uses the data from the training data set to figure out how much each petal and sepal length and width measurement should be weighted before they are added together to generate a number representing the evidence for a particular flower species. How is this done?
The way this is done is that we can figure out using a mathematical analysis how much each petal and sepal length and width measurement should be weighted before they are added together to generate a number representing the evidence for a particular flower species by choosing the weighting parameters in such a way so as to minimize the learning machine’s average squared prediction error. In the show notes for this episode on the website: www.learningmachines101.com you will find a technical memo of the mathematical details of this procedure which is based upon a formula which is called least-squares parameter estimation using a pseudo-inverse methodology. The parameter estimation procedure or learning rule is essentially equivalent to a statistical analysis method called linear regression. The mathematical details underlying this mathematical analysis are provided in the technical memo associated with this episode.
The fourth step is to run the linear learning machine software on your computer. Now let’s return to our discussion of the linear learning machine software program. We have in the folder you downloaded two different spreadsheets labeled the “trainingdata.xls” and the “testdata.xls”. We also assume that if you running the Windows or MAC OS-X version of the software that you have already installed the MCR Library on your computer using the MCR Installer. If you are running the MATLAB source code, then it is assumed that you have installed the computer program MATLAB on your computer. Finally, if you have opened up either the spreadsheets “testdata.xls” or “trainingdata.xls” make sure these files are closed before you proceed.
Run the program “golinear.exe” on a Windows machine by clicking on that program. The first time you click on it, it may take about 90 seconds for the program to be activated that is because the first time you run the program it is compiled using the MCR Library. The second time you run the program, the program will be activated within only about 5 or 10 seconds since the program has already been compiled.
After you execute the program “golinear.exe”, a splash screen appears which reviews the licensing agreement. Click ok and then a dialog box will be displayed asking you to identify which variables are the “targets”. The linear learning machine is designed to compute evidence supporting the presence or absence of the targets. If you select the targets “Setosa” and “Versicolour” then the linear learning machine will learn to compute the evidence for the presence of the targets “Setosa” and “Versicolour” given the petal and sepal length and width measurements in centimeters. Note that in order to select multiple targets you need to hold down the CONTROL key when you make your selection on a WINDOWS operating system. A splash screen confirming your choice of the target variables then appears.
The remaining variables which are presumed to be observable measurements on the flower are called the predictor variables. A splash screen confirming your choice of the predictor variables then appears.
The program then asks you to identify the spreadsheet file which contains the “test data”. It is required that the “test data” file has exactly the same number and type of columns as the “training data” file.
Finally, the program uses the training data to learn the parameters of the linear learning machine and then uses that learning learning machine with those estimated parameter values to generate the average squared prediction error for the training data and the test data sets. In addition, the output of the program displays the percentage classification error of the linear learning machine. This is computed by defining the learning machine’s prediction that the target variable takes on the value of a 1 by the situation where the evidence for that target value is greater than ½. When the evidence for the target value is less than or equal to ½ the learning machine is assumed to predict that the target variable takes on the value of 0. Thus, for each flower, the linear learning machine calculates the evidence supporting the presence of a flower species as a particular number using parameters estimated from the training data. If this evidence is greater than ½ the machine’s prediction that the flower belongs to that flower species is equal to 1 otherwise it is equal to 0. Using this method, the percentage of classification errors may be recorded.
The output display shows the average squared error on the training data is 0.08 and the average percentage classification error on the training data is about 13.3%, while the average squared error on the test data is 0.11 and the average percentage classification error on the test data is about 14.7%. This is roughly what we should expect, on the average, the performance of the linear learning machine on the test data using the average squared error performance measure should be a little worse than the performance of the linear learning machine on the training data using the average squared error performance measure. However, it is always possible for the linear learning machine to sometimes have a smaller classification error on the test data relative to the training data.
We can obtain an improved measure of classification error performance by using a 2-fold cross-validation method. We simply rerun the analysis but use the “training data set” as the test data and use the “test data set” as the training data. When we rerun the analysis we obtain the average percentage classification error on the training data which is “testdata.xls” is equal to 14% and the average percentage classification error on the test data which is “trainingdata.xls” is equal to 10%. Averaging 14% and 13.3% gives 13.6% for the 2-fold cross-validation training data classification error. Averaging 14.7% and 10% gives 12.35% for the 2-fold cross-validation test data classification error.
Note that the averaged squared error is reported because the linear learning machine uses this performance measure to estimate its parameters during the learning process. When making predictions about target variables which only take on the values of 0 or 1, more effective learning machines specifically designed to deal with such variables should be used. These alternative learning machines which include logistic regression models and support vector machines will be discussed in future episodes.
Also note that this Iris data set is a famous problem because it can shown that no linear machine can learn this data set perfectly with 0% classification error. In future episodes we will show how we can construct nonlinear learning machines that can, indeed, perfectly learn this data set with 0% classification error.
Further Reading:
Least Square Wikipedia article (http://en.wikipedia.org/wiki/Least_squares)
Cross Validation Wikipedia article (http://en.wikipedia.org/wiki/Cross-validation_(statistics))
UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/)
opyright © 2014-2015 by Richard M. Golden. All rights reserved.