library(dslabs)
mnist <- read_mnist()
30 Machine learning in practice
Now that we have learned several methods and explored them with simple, yet illustrative, examples, we will try them out on a real example: the MNIST digits.
We can load this data using the following dslabs package:
The dataset includes two components, a training set and test set:
names(mnist)
#> [1] "train" "test"
Each of these components includes a matrix with features in the columns:
dim(mnist$train$images)
#> [1] 60000 784
and vector with the classes as integers:
Because we want this example to run on a small laptop and in less than one hour, we will consider a subset of the dataset. We will sample 10,000 random rows from the training set and 1,000 random rows from the test set:
30.1 The caret package
We have already learned about several machine learning algorithms. Many of these algorithms are implemented in R. However, they are distributed via different packages, developed by different authors, and often use different syntax. The caret package tries to consolidate these differences and provide consistency. It currently includes over 200 different methods which are summarized in the caret package manual1. Keep in mind that caret does not include the needed packages and, to implement a package through caret, you still need to install the library. The required packages for each method are described in the package manual. The caret package also provides a function that performs cross validation for us. Here we provide some examples showing how we use this incredibly helpful package. We will use the 2 or 7 example to illustrate and in later sections we use use the package to run algorithms on the larger MNIST datset.
30.1.1 The train
functon
The caret train
function lets us train different algorithms using similar syntax. So, for example, we can type:
To make predictions, we can use the output of this function directly without needing to look at the specifics of predict.glm
and predict.knn
. Instead, we can learn how to obtain predictions from predict.train
.
The code looks the same for both methods:
This permits us to quickly compare the algorithms. For example, we can compare the accuracy like this:
fits <- list(glm = y_hat_glm, knn = y_hat_knn)
sapply(fits, function(fit) confusionMatrix(fit, mnist_27$test$y)$overall[["Accuracy"]])
#> glm knn
#> 0.75 0.84
30.1.2 Cross validation
When an algorithm includes a tuning parameter, train
automatically uses cross validation to decide among a few default values. To find out what parameter or parameters are optimized, you can read the manual 2 or study the output of:
getModelInfo("knn")
We can also use a quick lookup like this:
modelLookup("knn")
If we run it with default values:
train_knn <- train(y ~ ., method = "knn", data = mnist_27$train)
you can quickly see the results of the cross validation using the ggplot
function. The argument highlight
highlights the max:
ggplot(train_knn, highlight = TRUE)
By default, the cross validation is performed by taking 25 bootstrap samples comprised of 25% of the observations. For the kNN
method, the default is to try \(k=5,7,9\). We change this using the tuneGrid
parameter. The grid of values must be supplied by a data frame with the parameter names as specified in the modelLookup
output.
Here, we present an example where we try out 30 values between 9 and 67. To do this with caret, we need to define a column named k
, so we use this: data.frame(k = seq(9, 67, 2))
. Note that when running this code, we are fitting 30 versions of kNN to 25 bootstrapped samples. Since we are fitting \(30 \times 25 = 750\) kNN models, running this code will take several seconds. We set the seed because cross validation is a random procedure and we want to make sure the result here is reproducible.
set.seed(2008)
train_knn <- train(y ~ ., method = "knn",
data = mnist_27$train,
tuneGrid = data.frame(k = seq(9, 71, 2)))
ggplot(train_knn, highlight = TRUE)
To access the parameter that maximized the accuracy, you can use this:
train_knn$bestTune
#> k
#> 10 27
and the best performing model like this:
train_knn$finalModel
#> 27-nearest neighbor model
#> Training set outcome distribution:
#>
#> 2 7
#> 379 421
The function predict
will use this best performing model. Here is the accuracy of the best model when applied to the test set, which we have not used at all yet because the cross validation was done on the training set:
confusionMatrix(predict(train_knn, mnist_27$test, type = "raw"),
mnist_27$test$y)$overall["Accuracy"]
#> Accuracy
#> 0.835
If we want to change how we perform cross validation, we can use the trainControl
function. We can make the code above go a bit faster by using, for example, 10-fold cross validation. This means we have 10 samples using 10% of the observations each. We accomplish this using the following code:
control <- trainControl(method = "cv", number = 10, p = .9)
train_knn_cv <- train(y ~ ., method = "knn",
data = mnist_27$train,
tuneGrid = data.frame(k = seq(9, 71, 2)),
trControl = control)
ggplot(train_knn_cv, highlight = TRUE)
We notice that the accuracy estimates are more variable, which is expected since we changed the number of samples used to estimate accuracy.
Note that results
component of the train
output includes several summary statistics related to the variability of the cross validation estimates:
names(train_knn$results)
#> [1] "k" "Accuracy" "Kappa" "AccuracySD" "KappaSD"
We have only covered the basics. To caret package manual 3 includes many more details.
30.2 Preprocessing
In machine learning, we often transform predictors before running the machine algorithm. We also remove predictors that are clearly not useful. We call these steps preprocessing.
Examples of preprocessing include standardizing the predictors, taking the log transform of some predictors, removing predictors that are highly correlated with others, and removing predictors with very few non-unique values or close to zero variation. We show an example below.
We can run the nearZero
function from the caret package to see that several features do not vary much from observation to observation. We can see that there is a large number of features with 0 variability:
library(matrixStats)
sds <- colSds(x)
hist(sds, breaks = 256)
This is expected because there are parts of the image that rarely contain writing (dark pixels).
The caret packages includes a function that recommends features to be removed due to near zero variance:
nzv <- nearZeroVar(x)
We can see the columns recommended for removal:
So we end up keeping this number of columns:
Now we are ready to fit some models. Before we start, we need to add column names to the feature matrices as these are required by caret:
30.3 k-nearest neighbors
Before starting this section, be warned that the first two calls to the train
function in the code below can take several hours to run. This is common challenge when training machine learning algorithms since we have to run the algorithm for each cross validation split and each set of tuning parameter being considered. In the next section, we will provide some suggestion on how to predict the duration of the process and ways to reduce.
The first step is to optimize for \(k\).
train_knn <- train(x[ ,col_index], y,
method = "knn",
tuneGrid = data.frame(k = seq(3, 13, 2)))
Once we optimize our algorithm, we can fit it to the entire dataset:
fit_knn <- knn3(x[, col_index], y, k = train_knn$bestTune)
We achieve a high accuracy:
y_hat_knn <- predict(fit_knn, x_test[, col_index], type = "class")
confusionMatrix(y_hat_knn, factor(y_test))$overall["Accuracy"]
#> Accuracy
#> 0.945
An alternative to removing low variance columns directly, is to use dimension reduction on the feature matrix before applying the algorithms. It is important that we not use the test set when finding the PCs nor any summary of the data, as this could result in overtraining. So we start by applying prcomp
to the training data:
Next, and run knn on just a small number of dimensions. We try 36 dimensions since this explains about 80% of the data.
k <- 36
x_train <- pca$x[,1:k]
train_knn <- train(x_train, y,
method = "knn",
tuneGrid = data.frame(k = seq(3, 13, 2)))
fit <- knn3(x_train, y, k = train_knn$bestTune)
Now we apply the transformation we learned with the training data to the test data, reduce the dimension, and then run predict. Note that we used the rotation and column means estimated from the training data.
y_hat <- predict(fit, sweep(x_test, 2, col_means) %*% pca$rotation[,1:k], type = "class")
confusionMatrix(y_hat, factor(y_test))$overall["Accuracy"]
#> Accuracy
#> 0.962
With obtain an improvement in accuracy, while using only 36 dimensions.
30.4 Random Forest
With the random forest algorithm several parameters can be optimized, but the main one is mtry
, the number of predictors that are randomly selected for each tree. This is also the only tuning parameter that the caret function train
permits when using the default implementation from the randomForest package.
library(randomForest)
grid <- data.frame(mtry = seq(3, 24, 3))
train_rf <- train(x[, col_index], y,
method = "rf",
tuneGrid = grid)
Now that we have optimized our algorithm, we are ready to fit our final model:
fit_rf <- randomForest(x[, col_index], y, mtry = train_rf$bestTune$mtry)
#> randomForest 4.7-1.1
#> Type rfNews() to see new features/changes/bug fixes.
#>
#> Attaching package: 'randomForest'
#> The following object is masked from 'package:ggplot2':
#>
#> margin
As with kNN, we also achieve high accuracy:
y_hat_rf <- predict(fit_rf, x_test[ ,col_index])
confusionMatrix(y_hat_rf, y_test)$overall["Accuracy"]
#> Accuracy
#> 0.954
By optimizing some of the other algorithm parameters we can achieve even higher accuracy.
30.5 Testing and improving computation time
The default method for estimating accuracy used by the train
function is to test prediction on 25 bootstrap samples. This can result in long compute times. For examples, if we are considering several values, say 10, of the tuning parameters, we will fit the algorithm 250 times. We can use the system.time
function to estimate the how long it takes to run the algorithm once
system.time(fit_rf <- randomForest(x[, col_index], y, mtry = 9))
#> user system elapsed
#> 60.61 0.59 61.25
and use this to estimate the total time for the 250 iterations. In this case it will be several hours.
One way to reduce run time is to use k-fold cross validation with a smaller number of test sets. A popular choice is leaving out 10 test sets with 10% of the data:
control <- trainControl(method = "cv", number = 10, p = .9)
and re-running the train
function with this choice specified via the trControl
argument:
train_rf <- train(x[, col_index], y,
method = "rf",
tuneGrid = grid,
trControl = control)
For random forest we can also speed up the training step by running less trees per fit. After running the algorithm once, we can use the plot function to see how the error rate changes as the number of trees grows. Here we
plot(fit_rf)
We can see that error rate stabilizes after about 200 trees. We can use this finding to speed up the cross validation procedure. Specifically, because the default is 500, by adding the argument ntree = 200
to the call to train
above, the procedure will finish 2.5 times faster.
30.6 Variable importance
The following function computes the importance of each feature:
imp <- importance(fit_rf)
We can see which features are being used most by plotting an image:
30.7 Visual assessments
An important part of data analysis is visualizing results to determine why we are failing. How we do this depends on the application. Below we show the images of digits for which we made an incorrect prediction. Here are some errors for the random forest:
By examining errors like this we often find specific weaknesses to algorithms or parameter choices and can try to correct them.
30.8 Ensembles
The idea of an ensemble is similar to the idea of combining data from different pollsters to obtain a better estimate of the true support for each candidate.
In machine learning, one can usually greatly improve the final results by combining the results of different algorithms.
Here is a simple example where we compute new class probabilities by taking the average of random forest and kNN. We can see that the accuracy improves to 0.96:
In the exercises we are going to build several machine learning models for the mnist_27
dataset and then build an ensemble.
30.9 Exercises
1. Previously in the book, we have compared conditional probability give two predictors \(p(x_1,x_2)\) to the fit \(\hat{p}(x_1,x_2)\) obtained with a machine learning algorithm by making image plots. The following code can be used to make these images and include a curve at the values of \(x_1\) and \(x_2\) for which the function is \(0.5\).
plot_cond_prob <- function(x_1, x_2, p){
data.frame(x_1 = x_1, x_2 = x_2, p = p) |>
ggplot(aes(x_1, x_2)) +
geom_raster(aes(fill = p), show.legend = FALSE) +
stat_contour(aes(z = p), breaks = 0.5, color = "black") +
scale_fill_gradientn(colors = c("#F8766D", "white", "#00BFC4"))
}
We can see the true conditional probability for the 2 or 7 example like this:
with(mnist_27$true_p, plot_cond_prob(x_1, x_2, p))
Fit a kNN model and make this plot for the estimated conditional probability. Hint: Use the argument newdata=mnist27$train
to obtain predictions for a grid points.
2. Notice that, in the plot made in exercise 1, the boundary is somewhat wiggly. This is because kNN, like the basic bin smoother, does not use a kernel. To improve this we could try loess. By reading through the available models part of the manual4 we see that we can use the gamLoess
method. In the manual5 we also see that we need to install the gam package if we have not done so already. We see that we have two parameters to optimize:
modelLookup("gamLoess")
#> model parameter label forReg forClass probModel
#> 1 gamLoess span Span TRUE TRUE TRUE
#> 2 gamLoess degree Degree TRUE TRUE TRUE
Use cross-validation to pick a span between 0.15 and 0.75. Keep degree = 1
. What span does cross-validation select?
3. Show an image plot of the estimate \(\hat{p}(x,y)\) resulting from the model fit in the exercise 2. How does the accuracy compare to that of kNN? Comment on the difference between the estimate obtained with kNN.
4. Use the mnist_27
training set to build a model with several of the models available from the caret package. For example, you can try these:
models <- c("glm", "lda", "naive_bayes", "svmLinear", "gamboost",
"gamLoess", "qda", "knn", "kknn", "loclda", "gam", "rf",
"ranger","wsrf", "Rborist", "avNNet", "mlp", "monmlp", "gbm",
"adaboost", "svmRadial", "svmRadialCost", "svmRadialSigma")
We have not explained many of these, but apply them anyway using train
with all the default parameters. Keep the results in a list. You might need to install some packages. Keep in mind that you will likely get some warnings.
5. Now that you have all the trained models in a list, use sapply
or map
to create a matrix of predictions for the test set. You should end up with a matrix with length(mnist_27$test$y)
rows and length(models)
columns.
6. Now compute accuracy for each model on the test set.
7. Now build an ensemble prediction by majority vote and compute its accuracy.
8. Earlier we computed the accuracy of each method on the training set and noticed they varied. Which individual methods do better than the ensemble?
9. It is tempting to remove the methods that do not perform well and re-do the ensemble. The problem with this approach is that we are using the test data to make a decision. However, we could use the accuracy estimates obtained from cross validation with the training data. Obtain these estimates and save them in an object.
10. Now let’s only consider the methods with an estimated accuracy of 0.8 when constructing the ensemble. What is the accuracy now?
11. Advanced: If two methods give results that are the same, ensembling them will not change the results at all. For each pair of metrics compare the percent of time they call the same thing. Then use the heatmap
function to visualize the results. Hint: use the method = "binary"
argument in the dist
function.
12. Advanced: Note that each method can also produce an estimated conditional probability. Instead of majority vote we can take the average of these estimated conditional probabilities. For most methods, we can the use the type = "prob"
in the train function. However, some of the methods require you to use the argument trControl=trainControl(classProbs=TRUE)
when calling train. Also these methods do not work if classes have numbers as names. Hint: change the levels like this:
dat$train$y <- recode_factor(dat$train$y, "2"="two", "7"="seven")
dat$test$y <- recode_factor(dat$test$y, "2"="two", "7"="seven")
13. In this chapter, we illustrated a couple of machine learning algorithms on a subset of the MNIST dataset. Try fitting a model to the entire dataset.