```
library(dslabs)
mnist <- read_mnist()
```

# 30 Machine learning in practice

Now that we have learned several methods and explored them with simple, yet illustrative, examples, we will try them out on a real example: the MNIST digits.

We can load this data using the following **dslabs** package:

The dataset includes two components, a training set and test set:

```
names(mnist)
#> [1] "train" "test"
```

Each of these components includes a matrix with features in the columns:

```
dim(mnist$train$images)
#> [1] 60000 784
```

and vector with the classes as integers:

Because we want this example to run on a small laptop and in less than one hour, we will consider a subset of the dataset. We will sample 10,000 random rows from the training set and 1,000 random rows from the test set:

## 30.1 The caret package

We have already learned about several machine learning algorithms. Many of these algorithms are implemented in R. However, they are distributed via different packages, developed by different authors, and often use different syntax. The **caret** package tries to consolidate these differences and provide consistency. It currently includes over 200 different methods which are summarized in the **caret** package manual^{1}. Keep in mind that **caret** does not include the needed packages and, to implement a package through **caret**, you still need to install the library. The required packages for each method are described in the package manual. The **caret** package also provides a function that performs cross validation for us. Here we provide some examples showing how we use this incredibly helpful package. We will use the 2 or 7 example to illustrate and in later sections we use use the package to run algorithms on the larger MNIST datset.

###
30.1.1 The `train`

functon

The **caret** `train`

function lets us train different algorithms using similar syntax. So, for example, we can type:

To make predictions, we can use the output of this function directly without needing to look at the specifics of `predict.glm`

and `predict.knn`

. Instead, we can learn how to obtain predictions from `predict.train`

.

The code looks the same for both methods:

This permits us to quickly compare the algorithms. For example, we can compare the accuracy like this:

```
fits <- list(glm = y_hat_glm, knn = y_hat_knn)
sapply(fits, function(fit) confusionMatrix(fit, mnist_27$test$y)$overall[["Accuracy"]])
#> glm knn
#> 0.75 0.84
```

### 30.1.2 Cross validation

When an algorithm includes a tuning parameter, `train`

automatically uses cross validation to decide among a few default values. To find out what parameter or parameters are optimized, you can read the manual ^{2} or study the output of:

`getModelInfo("knn")`

We can also use a quick lookup like this:

`modelLookup("knn")`

If we run it with default values:

`train_knn <- train(y ~ ., method = "knn", data = mnist_27$train)`

you can quickly see the results of the cross validation using the `ggplot`

function. The argument `highlight`

highlights the max:

`ggplot(train_knn, highlight = TRUE)`

By default, the cross validation is performed by taking 25 bootstrap samples comprised of 25% of the observations. For the `kNN`

method, the default is to try \(k=5,7,9\). We change this using the `tuneGrid`

parameter. The grid of values must be supplied by a data frame with the parameter names as specified in the `modelLookup`

output.

Here, we present an example where we try out 30 values between 9 and 67. To do this with **caret**, we need to define a column named `k`

, so we use this: `data.frame(k = seq(9, 67, 2))`

. Note that when running this code, we are fitting 30 versions of kNN to 25 bootstrapped samples. Since we are fitting \(30 \times 25 = 750\) kNN models, running this code will take several seconds. We set the seed because cross validation is a random procedure and we want to make sure the result here is reproducible.

```
set.seed(2008)
train_knn <- train(y ~ ., method = "knn",
data = mnist_27$train,
tuneGrid = data.frame(k = seq(9, 71, 2)))
ggplot(train_knn, highlight = TRUE)
```

To access the parameter that maximized the accuracy, you can use this:

```
train_knn$bestTune
#> k
#> 10 27
```

and the best performing model like this:

```
train_knn$finalModel
#> 27-nearest neighbor model
#> Training set outcome distribution:
#>
#> 2 7
#> 379 421
```

The function `predict`

will use this best performing model. Here is the accuracy of the best model when applied to the test set, which we have not used at all yet because the cross validation was done on the training set:

```
confusionMatrix(predict(train_knn, mnist_27$test, type = "raw"),
mnist_27$test$y)$overall["Accuracy"]
#> Accuracy
#> 0.835
```

If we want to change how we perform cross validation, we can use the `trainControl`

function. We can make the code above go a bit faster by using, for example, 10-fold cross validation. This means we have 10 samples using 10% of the observations each. We accomplish this using the following code:

```
control <- trainControl(method = "cv", number = 10, p = .9)
train_knn_cv <- train(y ~ ., method = "knn",
data = mnist_27$train,
tuneGrid = data.frame(k = seq(9, 71, 2)),
trControl = control)
ggplot(train_knn_cv, highlight = TRUE)
```

We notice that the accuracy estimates are more variable, which is expected since we changed the number of samples used to estimate accuracy.

Note that `results`

component of the `train`

output includes several summary statistics related to the variability of the cross validation estimates:

```
names(train_knn$results)
#> [1] "k" "Accuracy" "Kappa" "AccuracySD" "KappaSD"
```

We have only covered the basics. To **caret** package manual ^{3} includes many more details.

## 30.2 Preprocessing

In machine learning, we often transform predictors before running the machine algorithm. We also remove predictors that are clearly not useful. We call these steps *preprocessing*.

Examples of preprocessing include standardizing the predictors, taking the log transform of some predictors, removing predictors that are highly correlated with others, and removing predictors with very few non-unique values or close to zero variation. We show an example below.

We can run the `nearZero`

function from the **caret** package to see that several features do not vary much from observation to observation. We can see that there is a large number of features with 0 variability:

```
library(matrixStats)
sds <- colSds(x)
hist(sds, breaks = 256)
```

This is expected because there are parts of the image that rarely contain writing (dark pixels).

The **caret** packages includes a function that recommends features to be removed due to *near zero variance*:

`nzv <- nearZeroVar(x)`

We can see the columns recommended for removal:

So we end up keeping this number of columns:

Now we are ready to fit some models. Before we start, we need to add column names to the feature matrices as these are required by **caret**:

## 30.3 k-nearest neighbors

Before starting this section, **be warned** that the first two calls to the `train`

function in the code below can take several hours to run. This is common challenge when training machine learning algorithms since we have to run the algorithm for each cross validation split and each set of tuning parameter being considered. In the next section, we will provide some suggestion on how to predict the duration of the process and ways to reduce.

The first step is to optimize for \(k\).

```
train_knn <- train(x[ ,col_index], y,
method = "knn",
tuneGrid = data.frame(k = seq(3, 13, 2)))
```

Once we optimize our algorithm, we can fit it to the entire dataset:

`fit_knn <- knn3(x[, col_index], y, k = train_knn$bestTune)`

We achieve a high accuracy:

```
y_hat_knn <- predict(fit_knn, x_test[, col_index], type = "class")
confusionMatrix(y_hat_knn, factor(y_test))$overall["Accuracy"]
#> Accuracy
#> 0.945
```

An alternative to removing low variance columns directly, is to use dimension reduction on the feature matrix before applying the algorithms. It is important that we not use the test set when finding the PCs nor any summary of the data, as this could result in overtraining. So we start by applying `prcomp`

to the training data:

Next, and run knn on just a small number of dimensions. We try 36 dimensions since this explains about 80% of the data.

```
k <- 36
x_train <- pca$x[,1:k]
train_knn <- train(x_train, y,
method = "knn",
tuneGrid = data.frame(k = seq(3, 13, 2)))
fit <- knn3(x_train, y, k = train_knn$bestTune)
```

Now we apply the transformation we learned with the training data to the test data, reduce the dimension, and then run predict. Note that we used the rotation and column means estimated from the training data.

```
y_hat <- predict(fit, sweep(x_test, 2, col_means) %*% pca$rotation[,1:k], type = "class")
confusionMatrix(y_hat, factor(y_test))$overall["Accuracy"]
#> Accuracy
#> 0.962
```

With obtain an improvement in accuracy, while using only 36 dimensions.

## 30.4 Random Forest

With the random forest algorithm several parameters can be optimized, but the main one is `mtry`

, the number of predictors that are randomly selected for each tree. This is also the only tuning parameter that the **caret** function `train`

permits when using the default implementation from the **randomForest** package.

```
library(randomForest)
grid <- data.frame(mtry = seq(3, 24, 3))
train_rf <- train(x[, col_index], y,
method = "rf",
tuneGrid = grid)
```

Now that we have optimized our algorithm, we are ready to fit our final model:

`fit_rf <- randomForest(x[, col_index], y, mtry = train_rf$bestTune$mtry)`

```
#> randomForest 4.7-1.1
#> Type rfNews() to see new features/changes/bug fixes.
#>
#> Attaching package: 'randomForest'
#> The following object is masked from 'package:ggplot2':
#>
#> margin
```

As with kNN, we also achieve high accuracy:

```
y_hat_rf <- predict(fit_rf, x_test[ ,col_index])
confusionMatrix(y_hat_rf, y_test)$overall["Accuracy"]
#> Accuracy
#> 0.954
```

By optimizing some of the other algorithm parameters we can achieve even higher accuracy.

## 30.5 Testing and improving computation time

The default method for estimating accuracy used by the `train`

function is to test prediction on 25 bootstrap samples. This can result in long compute times. For examples, if we are considering several values, say 10, of the tuning parameters, we will fit the algorithm 250 times. We can use the `system.time`

function to estimate the how long it takes to run the algorithm once

```
system.time(fit_rf <- randomForest(x[, col_index], y, mtry = 9))
#> user system elapsed
#> 60.61 0.59 61.25
```

and use this to estimate the total time for the 250 iterations. In this case it will be several hours.

One way to reduce run time is to use k-fold cross validation with a smaller number of test sets. A popular choice is leaving out 10 test sets with 10% of the data:

`control <- trainControl(method = "cv", number = 10, p = .9)`

and re-running the `train`

function with this choice specified via the `trControl`

argument:

```
train_rf <- train(x[, col_index], y,
method = "rf",
tuneGrid = grid,
trControl = control)
```

For random forest we can also speed up the training step by running less trees per fit. After running the algorithm once, we can use the plot function to see how the error rate changes as the number of trees grows. Here we

`plot(fit_rf)`

We can see that error rate stabilizes after about 200 trees. We can use this finding to speed up the cross validation procedure. Specifically, because the default is 500, by adding the argument `ntree = 200`

to the call to `train`

above, the procedure will finish 2.5 times faster.

## 30.6 Variable importance

The following function computes the importance of each feature:

`imp <- importance(fit_rf)`

We can see which features are being used most by plotting an image:

## 30.7 Visual assessments

An important part of data analysis is visualizing results to determine why we are failing. How we do this depends on the application. Below we show the images of digits for which we made an incorrect prediction. Here are some errors for the random forest:

By examining errors like this we often find specific weaknesses to algorithms or parameter choices and can try to correct them.

## 30.8 Ensembles

The idea of an ensemble is similar to the idea of combining data from different pollsters to obtain a better estimate of the true support for each candidate.

In machine learning, one can usually greatly improve the final results by combining the results of different algorithms.

Here is a simple example where we compute new class probabilities by taking the average of random forest and kNN. We can see that the accuracy improves to 0.96:

In the exercises we are going to build several machine learning models for the `mnist_27`

dataset and then build an ensemble.

## 30.9 Exercises

1. Previously in the book, we have compared conditional probability give two predictors \(p(x_1,x_2)\) to the fit \(\hat{p}(x_1,x_2)\) obtained with a machine learning algorithm by making image plots. The following code can be used to make these images and include a curve at the values of \(x_1\) and \(x_2\) for which the function is \(0.5\).

```
plot_cond_prob <- function(x_1, x_2, p){
data.frame(x_1 = x_1, x_2 = x_2, p = p) |>
ggplot(aes(x_1, x_2)) +
geom_raster(aes(fill = p), show.legend = FALSE) +
stat_contour(aes(z = p), breaks = 0.5, color = "black") +
scale_fill_gradientn(colors = c("#F8766D", "white", "#00BFC4"))
}
```

We can see the true conditional probability for the 2 or 7 example like this:

`with(mnist_27$true_p, plot_cond_prob(x_1, x_2, p))`

Fit a kNN model and make this plot for the estimated conditional probability. Hint: Use the argument `newdata=mnist27$train`

to obtain predictions for a grid points.

2. Notice that, in the plot made in exercise 1, the boundary is somewhat wiggly. This is because kNN, like the basic bin smoother, does not use a kernel. To improve this we could try loess. By reading through the available models part of the manual^{4} we see that we can use the `gamLoess`

method. In the manual^{5} we also see that we need to install the **gam** package if we have not done so already. We see that we have two parameters to optimize:

```
modelLookup("gamLoess")
#> model parameter label forReg forClass probModel
#> 1 gamLoess span Span TRUE TRUE TRUE
#> 2 gamLoess degree Degree TRUE TRUE TRUE
```

Use cross-validation to pick a span between 0.15 and 0.75. Keep `degree = 1`

. What span does cross-validation select?

3. Show an image plot of the estimate \(\hat{p}(x,y)\) resulting from the model fit in the exercise 2. How does the accuracy compare to that of kNN? Comment on the difference between the estimate obtained with kNN.

4. Use the `mnist_27`

training set to build a model with several of the models available from the **caret** package. For example, you can try these:

```
models <- c("glm", "lda", "naive_bayes", "svmLinear", "gamboost",
"gamLoess", "qda", "knn", "kknn", "loclda", "gam", "rf",
"ranger","wsrf", "Rborist", "avNNet", "mlp", "monmlp", "gbm",
"adaboost", "svmRadial", "svmRadialCost", "svmRadialSigma")
```

We have not explained many of these, but apply them anyway using `train`

with all the default parameters. Keep the results in a list. You might need to install some packages. Keep in mind that you will likely get some warnings.

5. Now that you have all the trained models in a list, use `sapply`

or `map`

to create a matrix of predictions for the test set. You should end up with a matrix with `length(mnist_27$test$y)`

rows and `length(models)`

columns.

6. Now compute accuracy for each model on the test set.

7. Now build an ensemble prediction by majority vote and compute its accuracy.

8. Earlier we computed the accuracy of each method on the training set and noticed they varied. Which individual methods do better than the ensemble?

9. It is tempting to remove the methods that do not perform well and re-do the ensemble. The problem with this approach is that we are using the test data to make a decision. However, we could use the accuracy estimates obtained from cross validation with the training data. Obtain these estimates and save them in an object.

10. Now let’s only consider the methods with an estimated accuracy of 0.8 when constructing the ensemble. What is the accuracy now?

11. **Advanced**: If two methods give results that are the same, ensembling them will not change the results at all. For each pair of metrics compare the percent of time they call the same thing. Then use the `heatmap`

function to visualize the results. Hint: use the `method = "binary"`

argument in the `dist`

function.

12. **Advanced**: Note that each method can also produce an estimated conditional probability. Instead of majority vote we can take the average of these estimated conditional probabilities. For most methods, we can the use the `type = "prob"`

in the train function. However, some of the methods require you to use the argument `trControl=trainControl(classProbs=TRUE)`

when calling train. Also these methods do not work if classes have numbers as names. Hint: change the levels like this:

```
dat$train$y <- recode_factor(dat$train$y, "2"="two", "7"="seven")
dat$test$y <- recode_factor(dat$test$y, "2"="two", "7"="seven")
```

13. In this chapter, we illustrated a couple of machine learning algorithms on a subset of the MNIST dataset. Try fitting a model to the entire dataset.