30  Machine learning in practice

Now that we have learned several methods and explored them with simple, yet illustrative, examples, we will try them out on a real example: the MNIST digits.

We can load this data using the following dslabs package:

mnist <- read_mnist()

The dataset includes two components, a training set and test set:

#> [1] "train" "test"

Each of these components includes a matrix with features in the columns:

#> [1] 60000   784

and vector with the classes as integers:

#> [1] "integer"
#>    0    1    2    3    4    5    6    7    8    9 
#> 5923 6742 5958 6131 5842 5421 5918 6265 5851 5949

Because we want this example to run on a small laptop and in less than one hour, we will consider a subset of the dataset. We will sample 10,000 random rows from the training set and 1,000 random rows from the test set:

index <- sample(nrow(mnist$train$images), 10000)
x <- mnist$train$images[index,]
y <- factor(mnist$train$labels[index])
index <- sample(nrow(mnist$test$images), 1000)
x_test <- mnist$test$images[index,]
y_test <- factor(mnist$test$labels[index])

30.1 The caret package

We have already learned about several machine learning algorithms. Many of these algorithms are implemented in R. However, they are distributed via different packages, developed by different authors, and often use different syntax. The caret package tries to consolidate these differences and provide consistency. It currently includes over 200 different methods which are summarized in the caret package manual1. Keep in mind that caret does not include the needed packages and, to implement a package through caret, you still need to install the library. The required packages for each method are described in the package manual. The caret package also provides a function that performs cross validation for us. Here we provide some examples showing how we use this incredibly helpful package. We will use the 2 or 7 example to illustrate and in later sections we use use the package to run algorithms on the larger MNIST datset.

30.1.1 The train functon

The caret train function lets us train different algorithms using similar syntax. So, for example, we can type:

#> Loading required package: lattice
train_glm <- train(y ~ ., method = "glm", data = mnist_27$train)
train_knn <- train(y ~ ., method = "knn", data = mnist_27$train)

To make predictions, we can use the output of this function directly without needing to look at the specifics of predict.glm and predict.knn. Instead, we can learn how to obtain predictions from predict.train.

The code looks the same for both methods:

y_hat_glm <- predict(train_glm, mnist_27$test, type = "raw")
y_hat_knn <- predict(train_knn, mnist_27$test, type = "raw")

This permits us to quickly compare the algorithms. For example, we can compare the accuracy like this:

fits <- list(glm = y_hat_glm, knn = y_hat_knn)
sapply(fits, function(fit) confusionMatrix(fit, mnist_27$test$y)$overall[["Accuracy"]])
#>  glm  knn 
#> 0.75 0.84

30.1.2 Cross validation

When an algorithm includes a tuning parameter, train automatically uses cross validation to decide among a few default values. To find out what parameter or parameters are optimized, you can read the manual 2 or study the output of:

We can also use a quick lookup like this:

If we run it with default values:

train_knn <- train(y ~ ., method = "knn", data = mnist_27$train)

you can quickly see the results of the cross validation using the ggplot function. The argument highlight highlights the max:

ggplot(train_knn, highlight = TRUE)

By default, the cross validation is performed by taking 25 bootstrap samples comprised of 25% of the observations. For the kNN method, the default is to try \(k=5,7,9\). We change this using the tuneGrid parameter. The grid of values must be supplied by a data frame with the parameter names as specified in the modelLookup output.

Here, we present an example where we try out 30 values between 9 and 67. To do this with caret, we need to define a column named k, so we use this: data.frame(k = seq(9, 67, 2)). Note that when running this code, we are fitting 30 versions of kNN to 25 bootstrapped samples. Since we are fitting \(30 \times 25 = 750\) kNN models, running this code will take several seconds. We set the seed because cross validation is a random procedure and we want to make sure the result here is reproducible.

train_knn <- train(y ~ ., method = "knn", 
                   data = mnist_27$train,
                   tuneGrid = data.frame(k = seq(9, 71, 2)))
ggplot(train_knn, highlight = TRUE)

To access the parameter that maximized the accuracy, you can use this:

#>     k
#> 10 27

and the best performing model like this:

#> 27-nearest neighbor model
#> Training set outcome distribution:
#>   2   7 
#> 379 421

The function predict will use this best performing model. Here is the accuracy of the best model when applied to the test set, which we have not used at all yet because the cross validation was done on the training set:

confusionMatrix(predict(train_knn, mnist_27$test, type = "raw"),
#> Accuracy 
#>    0.835

If we want to change how we perform cross validation, we can use the trainControl function. We can make the code above go a bit faster by using, for example, 10-fold cross validation. This means we have 10 samples using 10% of the observations each. We accomplish this using the following code:

control <- trainControl(method = "cv", number = 10, p = .9)
train_knn_cv <- train(y ~ ., method = "knn", 
                   data = mnist_27$train,
                   tuneGrid = data.frame(k = seq(9, 71, 2)),
                   trControl = control)
ggplot(train_knn_cv, highlight = TRUE)

We notice that the accuracy estimates are more variable, which is expected since we changed the number of samples used to estimate accuracy.

Note that results component of the train output includes several summary statistics related to the variability of the cross validation estimates:

#> [1] "k"          "Accuracy"   "Kappa"      "AccuracySD" "KappaSD"

We have only covered the basics. To caret package manual 3 includes many more details.

30.2 Preprocessing

In machine learning, we often transform predictors before running the machine algorithm. We also remove predictors that are clearly not useful. We call these steps preprocessing.

Examples of preprocessing include standardizing the predictors, taking the log transform of some predictors, removing predictors that are highly correlated with others, and removing predictors with very few non-unique values or close to zero variation. We show an example below.

We can run the nearZero function from the caret package to see that several features do not vary much from observation to observation. We can see that there is a large number of features with 0 variability:

sds <- colSds(x)
hist(sds, breaks = 256)

This is expected because there are parts of the image that rarely contain writing (dark pixels).

The caret packages includes a function that recommends features to be removed due to near zero variance:

nzv <- nearZeroVar(x)

We can see the columns recommended for removal:

image(matrix(1:784 %in% nzv, 28, 28))
image(matrix(1:784 %in% nzv, 28, 28))

So we end up keeping this number of columns:

col_index <- setdiff(1:ncol(x), nzv)
#> [1] 252

Now we are ready to fit some models. Before we start, we need to add column names to the feature matrices as these are required by caret:

colnames(x) <- 1:ncol(mnist$train$images)
colnames(x_test) <- colnames(x)

30.3 k-nearest neighbors

Before starting this section, be warned that the first two calls to the train function in the code below can take several hours to run. This is common challenge when training machine learning algorithms since we have to run the algorithm for each cross validation split and each set of tuning parameter being considered. In the next section, we will provide some suggestion on how to predict the duration of the process and ways to reduce.

The first step is to optimize for \(k\).

train_knn <- train(x[ ,col_index], y, 
                   method = "knn", 
                   tuneGrid = data.frame(k = seq(3, 13, 2)))

Once we optimize our algorithm, we can fit it to the entire dataset:

fit_knn <- knn3(x[, col_index], y,  k = train_knn$bestTune)

We achieve a high accuracy:

y_hat_knn <- predict(fit_knn, x_test[, col_index], type = "class")
confusionMatrix(y_hat_knn, factor(y_test))$overall["Accuracy"]
#> Accuracy 
#>    0.945

An alternative to removing low variance columns directly, is to use dimension reduction on the feature matrix before applying the algorithms. It is important that we not use the test set when finding the PCs nor any summary of the data, as this could result in overtraining. So we start by applying prcomp to the training data:

col_means <- colMeans(x)
pca <- prcomp(sweep(x, 2, col_means), center = FALSE)

Next, and run knn on just a small number of dimensions. We try 36 dimensions since this explains about 80% of the data.

k <- 36
x_train <- pca$x[,1:k]
train_knn <- train(x_train, y, 
                   method = "knn", 
                   tuneGrid = data.frame(k = seq(3, 13, 2)))
fit <- knn3(x_train, y, k = train_knn$bestTune)

Now we apply the transformation we learned with the training data to the test data, reduce the dimension, and then run predict. Note that we used the rotation and column means estimated from the training data.

y_hat <- predict(fit, sweep(x_test, 2, col_means) %*% pca$rotation[,1:k], type = "class")
confusionMatrix(y_hat, factor(y_test))$overall["Accuracy"]
#> Accuracy 
#>    0.962

With obtain an improvement in accuracy, while using only 36 dimensions.

30.4 Random Forest

With the random forest algorithm several parameters can be optimized, but the main one is mtry, the number of predictors that are randomly selected for each tree. This is also the only tuning parameter that the caret function train permits when using the default implementation from the randomForest package.

grid <- data.frame(mtry = seq(3, 24, 3))
train_rf <-  train(x[, col_index], y, 
                   method = "rf", 
                   tuneGrid = grid)

Now that we have optimized our algorithm, we are ready to fit our final model:

fit_rf <- randomForest(x[, col_index], y, mtry = train_rf$bestTune$mtry)
#> randomForest 4.7-1.1
#> Type rfNews() to see new features/changes/bug fixes.
#> Attaching package: 'randomForest'
#> The following object is masked from 'package:ggplot2':
#>     margin

As with kNN, we also achieve high accuracy:

y_hat_rf <- predict(fit_rf, x_test[ ,col_index])
confusionMatrix(y_hat_rf, y_test)$overall["Accuracy"]
#> Accuracy 
#>    0.954

By optimizing some of the other algorithm parameters we can achieve even higher accuracy.

30.5 Testing and improving computation time

The default method for estimating accuracy used by the train function is to test prediction on 25 bootstrap samples. This can result in long compute times. For examples, if we are considering several values, say 10, of the tuning parameters, we will fit the algorithm 250 times. We can use the system.time function to estimate the how long it takes to run the algorithm once

system.time(fit_rf <- randomForest(x[, col_index], y,  mtry = 9))
#>    user  system elapsed 
#>   60.61    0.59   61.25

and use this to estimate the total time for the 250 iterations. In this case it will be several hours.

One way to reduce run time is to use k-fold cross validation with a smaller number of test sets. A popular choice is leaving out 10 test sets with 10% of the data:

control <- trainControl(method = "cv", number = 10, p = .9)

and re-running the train function with this choice specified via the trControl argument:

train_rf <-  train(x[, col_index], y, 
                   method = "rf", 
                   tuneGrid = grid,
                   trControl = control)

For random forest we can also speed up the training step by running less trees per fit. After running the algorithm once, we can use the plot function to see how the error rate changes as the number of trees grows. Here we


We can see that error rate stabilizes after about 200 trees. We can use this finding to speed up the cross validation procedure. Specifically, because the default is 500, by adding the argument ntree = 200 to the call to train above, the procedure will finish 2.5 times faster.

30.6 Variable importance

The following function computes the importance of each feature:

imp <- importance(fit_rf)

We can see which features are being used most by plotting an image:

mat <- rep(0, ncol(x))
mat[col_index] <- imp
image(matrix(mat, 28, 28))
mat <- rep(0, ncol(x))
mat[col_index] <- imp
image(matrix(mat, 28, 28))

30.7 Visual assessments

An important part of data analysis is visualizing results to determine why we are failing. How we do this depends on the application. Below we show the images of digits for which we made an incorrect prediction. Here are some errors for the random forest:

By examining errors like this we often find specific weaknesses to algorithms or parameter choices and can try to correct them.

30.8 Ensembles

The idea of an ensemble is similar to the idea of combining data from different pollsters to obtain a better estimate of the true support for each candidate.

In machine learning, one can usually greatly improve the final results by combining the results of different algorithms.

Here is a simple example where we compute new class probabilities by taking the average of random forest and kNN. We can see that the accuracy improves to 0.96:

p_rf <- predict(fit_rf, x_test[,col_index], type = "prob")  
p_rf <- p_rf / rowSums(p_rf)
p_knn  <- predict(fit_knn, x_test[,col_index])
p <- (p_rf + p_knn)/2
y_pred <- factor(apply(p, 1, which.max) - 1)
confusionMatrix(y_pred, y_test)$overall["Accuracy"]
#> Accuracy 
#>    0.954

In the exercises we are going to build several machine learning models for the mnist_27 dataset and then build an ensemble.

30.9 Exercises

1. Previously in the book, we have compared conditional probability give two predictors \(p(x_1,x_2)\) to the fit \(\hat{p}(x_1,x_2)\) obtained with a machine learning algorithm by making image plots. The following code can be used to make these images and include a curve at the values of \(x_1\) and \(x_2\) for which the function is \(0.5\).

plot_cond_prob <- function(x_1, x_2, p){
  data.frame(x_1 = x_1, x_2 = x_2, p = p) |>
    ggplot(aes(x_1, x_2)) +
    geom_raster(aes(fill = p), show.legend = FALSE) +
    stat_contour(aes(z = p), breaks = 0.5, color = "black") +
    scale_fill_gradientn(colors = c("#F8766D", "white", "#00BFC4"))

We can see the true conditional probability for the 2 or 7 example like this:

with(mnist_27$true_p, plot_cond_prob(x_1, x_2, p))

Fit a kNN model and make this plot for the estimated conditional probability. Hint: Use the argument newdata=mnist27$train to obtain predictions for a grid points.

2. Notice that, in the plot made in exercise 1, the boundary is somewhat wiggly. This is because kNN, like the basic bin smoother, does not use a kernel. To improve this we could try loess. By reading through the available models part of the manual4 we see that we can use the gamLoess method. In the manual5 we also see that we need to install the gam package if we have not done so already. We see that we have two parameters to optimize:

#>      model parameter  label forReg forClass probModel
#> 1 gamLoess      span   Span   TRUE     TRUE      TRUE
#> 2 gamLoess    degree Degree   TRUE     TRUE      TRUE

Use cross-validation to pick a span between 0.15 and 0.75. Keep degree = 1. What span does cross-validation select?

3. Show an image plot of the estimate \(\hat{p}(x,y)\) resulting from the model fit in the exercise 2. How does the accuracy compare to that of kNN? Comment on the difference between the estimate obtained with kNN.

4. Use the mnist_27 training set to build a model with several of the models available from the caret package. For example, you can try these:

models <- c("glm", "lda",  "naive_bayes",  "svmLinear", "gamboost",  
            "gamLoess", "qda", "knn", "kknn", "loclda", "gam", "rf", 
            "ranger","wsrf", "Rborist", "avNNet", "mlp", "monmlp", "gbm", 
            "adaboost", "svmRadial", "svmRadialCost", "svmRadialSigma")

We have not explained many of these, but apply them anyway using train with all the default parameters. Keep the results in a list. You might need to install some packages. Keep in mind that you will likely get some warnings.

5. Now that you have all the trained models in a list, use sapply or map to create a matrix of predictions for the test set. You should end up with a matrix with length(mnist_27$test$y) rows and length(models) columns.

6. Now compute accuracy for each model on the test set.

7. Now build an ensemble prediction by majority vote and compute its accuracy.

8. Earlier we computed the accuracy of each method on the training set and noticed they varied. Which individual methods do better than the ensemble?

9. It is tempting to remove the methods that do not perform well and re-do the ensemble. The problem with this approach is that we are using the test data to make a decision. However, we could use the accuracy estimates obtained from cross validation with the training data. Obtain these estimates and save them in an object.

10. Now let’s only consider the methods with an estimated accuracy of 0.8 when constructing the ensemble. What is the accuracy now?

11. Advanced: If two methods give results that are the same, ensembling them will not change the results at all. For each pair of metrics compare the percent of time they call the same thing. Then use the heatmap function to visualize the results. Hint: use the method = "binary" argument in the dist function.

12. Advanced: Note that each method can also produce an estimated conditional probability. Instead of majority vote we can take the average of these estimated conditional probabilities. For most methods, we can the use the type = "prob" in the train function. However, some of the methods require you to use the argument trControl=trainControl(classProbs=TRUE) when calling train. Also these methods do not work if classes have numbers as names. Hint: change the levels like this:

dat$train$y <- recode_factor(dat$train$y, "2"="two", "7"="seven")
dat$test$y <- recode_factor(dat$test$y, "2"="two", "7"="seven")

13. In this chapter, we illustrated a couple of machine learning algorithms on a subset of the MNIST dataset. Try fitting a model to the entire dataset.

  1. https://topepo.github.io/caret/available-models.html↩︎

  2. http://topepo.github.io/caret/available-models.html↩︎

  3. https://topepo.github.io/caret/available-models.html↩︎

  4. https://topepo.github.io/caret/available-models.html↩︎

  5. https://topepo.github.io/caret/train-models-by-tag.html↩︎