Supervised Learning Workflow and Algorithms

What is Supervised Learning?

The aim of supervised, machine learning is to build a model that makes predictions based on evidence in the presence of uncertainty. As adaptive algorithms identify patterns in data, a computer "learns" from the observations. When exposed to more observations, the computer improves its predictive performance.

Specifically, a supervised learning algorithm takes a known set of input data and known responses to the data (output), and trains a model to generate reasonable predictions for the response to new data.

For example, suppose you want to predict whether someone will have a heart attack within a year. You have a set of data on previous patients, including age, weight, height, blood pressure, etc. You know whether the previous patients had heart attacks within a year of their measurements. So, the problem is combining all the existing data into a model that can predict whether a new person will have a heart attack within a year.

You can think of the entire set of input data as a heterogeneous matrix. Rows of the matrix are called observations, examples, or instances, and each contain a set of measurements for a subject (patients in the example). Columns of the matrix are called predictors, attributes, or features, and each are variables representing a measurement taken on every subject (age, weight, height, etc. in the example). You can think of the response data as a column vector where each row contains the output of the corresponding observation in the input data (whether the patient had a heart attack). To fit or train a supervised learning model, choose an appropriate algorithm, and then pass the input and response data to it.

Supervised learning splits into two broad categories: classification and regression.

In classification, the goal is to assign a class (or label) from a finite set of classes to an observation. That is, responses are categorical variables. Applications include spam filters, advertisement recommendation systems, and image and speech recognition. Predicting whether a patient will have a heart attack within a year is a classification problem, and the possible classes are true and false. Classification algorithms usually apply to nominal response values. However, some algorithms can accommodate ordinal classes (see fitcecoc).
In regression, the goal is to predict a continuous measurement for an observation. That is, the responses variables are real numbers. Applications include forecasting stock prices, energy consumption, or disease incidence.

Statistics and Machine Learning Toolbox™ supervised learning functionalities comprise a stream-lined, object framework. You can efficiently train a variety of algorithms, combine models into an ensemble, assess model performances, cross-validate, and predict responses for new data.

Steps in Supervised Learning

While there are many Statistics and Machine Learning Toolbox algorithms for supervised learning, most use the same basic workflow for obtaining a predictor model. (Detailed instruction on the steps for ensemble learning is in Framework for Ensemble Learning.) The steps for supervised learning are:

Prepare Data
Choose an Algorithm
Fit a Model
Choose a Validation Method
Examine Fit and Update Until Satisfied
Use Fitted Model for Predictions

Prepare Data

All supervised learning methods start with an input data matrix, usually called X here. Each row of X represents one observation. Each column of X represents one variable, or predictor. Represent missing entries with NaN values in X. Statistics and Machine Learning Toolbox supervised learning algorithms can handle NaN values, either by ignoring them or by ignoring any row with a NaN value.

You can use various data types for response data Y. Each element in Y represents the response to the corresponding row of X. Observations with missing Y data are ignored.

For regression, Y must be a numeric vector with the same number of elements as the number of rows of X.
For classification, Y can be any of these data types. This table also contains the method of including missing entries.
Data Type Missing Entry
Numeric vector NaN
Categorical vector <undefined>
Character array Row of spaces
String array <missing> or ""
Cell array of character vectors ''
Logical vector (Cannot represent)

Data Type	Missing Entry
Numeric vector	`NaN`
Categorical vector	`<undefined>`
Character array	Row of spaces
String array	`<missing>` or `""`
Cell array of character vectors	`''`
Logical vector	(Cannot represent)

Choose an Algorithm

There are tradeoffs between several characteristics of algorithms, such as:

Speed of training
Memory usage
Predictive accuracy on new data
Transparency or interpretability, meaning how easily you can understand the reasons an algorithm makes its predictions

Details of the algorithms appear in Characteristics of Classification Algorithms. More detail about ensemble algorithms is in Choose an Applicable Ensemble Aggregation Method.

Fit a Model

The fitting function you use depends on the algorithm you choose.

Algorithm	Fitting Function
Classification Trees	`fitctree`
Regression Trees	`fitrtree`
Discriminant Analysis (classification)	`fitcdiscr`
k-Nearest Neighbors (classification)	`fitcknn`
Naive Bayes (classification)	`fitcnb`
Support Vector Machines (SVM) for classification	`fitcsvm`
SVM for regression	`fitrsvm`
Multiclass models for SVM or other classifiers	`fitcecoc`
Classification Ensembles	`fitcensemble`
Regression Ensembles	`fitrensemble`
Classification or Regression Tree Ensembles (e.g., Random Forests [1]) in Parallel	`TreeBagger`

For a comparison of these algorithms, see Characteristics of Classification Algorithms.

Choose a Validation Method

The three main methods to examine the accuracy of the resulting fitted model are:

Examine the resubstitution error. For examples, see:
Examine the cross-validation error. For examples, see:
Examine the out-of-bag error for bagged decision trees. For examples, see:

Examine Fit and Update Until Satisfied

After validating the model, you might want to change it for better accuracy, better speed, or to use less memory.

Change fitting parameters to try to get a more accurate model. For examples, see:
Change fitting parameters to try to get a smaller model. This sometimes gives a model with more accuracy. For examples, see:
Try a different algorithm. For applicable choices, see:
- Characteristics of Classification Algorithms
- Choose an Applicable Ensemble Aggregation Method

When satisfied with a model of some types, you can trim it using the appropriate compact function (compact for classification trees, compact for regression trees, compact for discriminant analysis, compact for naive Bayes, compact for SVM, compact for ECOC models, compact for classification ensembles, and compact for regression ensembles). compact removes training data and other properties not required for prediction, e.g., pruning information for decision trees, from the model to reduce memory consumption. Because kNN classification models require all of the training data to predict labels, you cannot reduce the size of a ClassificationKNN model.

Use Fitted Model for Predictions

To predict classification or regression response for most fitted models, use the predict method:

Ypredicted = predict(obj,Xnew)

obj is the fitted model or fitted compact model.
Xnew is the new input data.
Ypredicted is the predicted response, either classification or regression.

Characteristics of Classification Algorithms

This table shows typical characteristics of the various supervised learning algorithms. The characteristics in any particular case can vary from the listed ones. Use the table as a guide for your initial choice of algorithms. Decide on the tradeoff you want in speed, memory usage, flexibility, and interpretability.

Tip

Try a decision tree or discriminant first, because these classifiers are fast and easy to interpret. If the models are not accurate enough predicting the response, try other classifiers with higher flexibility.

To control flexibility, see the details for each classifier type. To avoid overfitting, look for a model of lower flexibility that provides sufficient accuracy.

Classifier	Multiclass Support	Categorical Predictor Support	Prediction Speed	Memory Usage	Interpretability
Decision Trees — `fitctree`	Yes	Yes	Fast	Small	Easy
Discriminant analysis — `fitcdiscr`	Yes	No	Fast	Small for linear, large for quadratic	Easy
SVM — `fitcsvm`	No. Combine multiple binary SVM classifiers using `fitcecoc`.	Yes	Medium for linear. Slow for others.	Medium for linear. All others: medium for multiclass, large for binary.	Easy for linear SVM. Hard for all other kernel types.
Naive Bayes — `fitcnb`	Yes	Yes	Medium for simple distributions. Slow for kernel distributions or high-dimensional data	Small for simple distributions. Medium for kernel distributions or high-dimensional data	Easy
Nearest neighbor — `fitcknn`	Yes	Yes	Slow for cubic. Medium for others.	Medium	Hard
Ensembles — `fitcensemble` and `fitrensemble`	Yes	Yes	Fast to medium depending on choice of algorithm	Low to high depending on choice of algorithm.	Hard

The results in this table are based on an analysis of many data sets. The data sets in the study have up to 7000 observations, 80 predictors, and 50 classes. This list defines the terms in the table.

Speed:

Fast — 0.01 second
Medium — 1 second
Slow — 100 seconds

Memory

Small — 1MB
Medium — 4MB
Large — 100MB

Note

The table provides a general guide. Your results depend on your data and the speed of your machine.

Categorical Predictor Support

This table describes the data-type support of predictors for each classifier.

Classifier	All predictors numeric	All predictors categorical	Some categorical, some numeric
Decision Trees	Yes	Yes	Yes
Discriminant Analysis	Yes	No	No
SVM	Yes	Yes	Yes
Naive Bayes	Yes	Yes	Yes
Nearest Neighbor	Euclidean distance only	Hamming distance only	No
Ensembles	Yes	Yes, except subspace ensembles of discriminant analysis classifiers	Yes, except subspace ensembles

References

[1] Breiman, L. Random Forests. Machine Learning 45, 2001, pp. 5–32.

Documentation