26 Notation and Terminology

In Section 21.2, we introduced the MNIST handwritten digits dataset. Here we describe how the task of automatically reading these digits can be framed as a machine learning challenge. In doing so, we introduce machine learning mathematical notation and terminology used throughout this part of the book.

Originally, mail sorting in the post office involved humans reading zip codes written on the envelopes. Today, thanks to machine learning algorithms, a computer can read zip codes and then a robot sorts the letters. We will learn how to build algorithms that can read a digitized handwritten digit.

26.1 Terminology

In machine learning, data comes in the form of the outcome we want to predict and the features that we will use to predict the outcome. We build algorithms that take feature values as input and returns a prediction for the outcome when we don’t know the outcome. The machine learning approach is to train an algorithm using a dataset for which we do know the outcome, and then apply this algorithm in the future to make a prediction when we don’t know the outcome.

Prediction problems can be divided into categorical and continuous outcomes. For categorical outcomes can be any one of \(K\) classes. The number of classes can vary greatly across applications. For example, in the digit reader data, \(K=10\) with the classes being the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. In speech recognition, the outcomes are all possible words or phrases we are trying to detect. Spam detection has two outcomes: spam or not spam. In this book, we denote the \(K\) categories with indexes \(k=1,\dots,K\). However, for binary data we will use \(k=0,1\) for mathematical conveniences that we demonstrate later.

26.2 Notation

We will use \(Y\) to denote the outcome and \(X_1, \dots, X_p\) to denote the features. These features are also sometimes referred to as predictors or covariates, and we will treat these terms as synonyms.

The first step in building a machine learning algorithm is to clearly identify what the outcomes and features are. In Section 21.2, we showed that each digitized image \(i\) is associated with a categorical outcome \(Y_i\) and a set of features \(X_{i,1}, \dots, X_{i,p}\), with \(p=784\). For convenience, we often use boldface notation \(\mathbf{X}_i = (X_{i,1}, \dots, X_{i,p})^\top\) to represent the vector of predictors, following the notation introduced in Section 21.1. When referring to an arbitrary set of features rather than a specific image, we drop the index \(i\) and simply use \(Y\) and \(\mathbf{X} = (X_1, \dots, X_p)\). We use uppercase to emphasize that these are random variables. Observed values are denoted in lowercase, such as \(\mathbf{X} = \mathbf{x}\). In practice, when writing code, we typically use lowercase.

The machine learning task is to build an algorithm that predicts the outcome given any combination of features. At first glance, this might seem impossible, but we will start with very simple examples and gradually build toward more complex cases. We begin with one predictor, then extend to two predictors, and eventually tackle real-world challenges involving thousands of predictors.

26.3 The machine learning challenge

The general setup is as follows. We have a series of features and an unknown outcome we want to predict:

outcome	feature 1	feature 2	feature 3	\(\dots\)	feature p
?	\(X_1\)	\(X_2\)	\(X_3\)	\(\dots\)	\(X_p\)

To build a model that predicts outcomes from observed features \(X_1=x_1, X_2=x_2, \dots, X_p=x_p\), we need a dataset where the outcomes are known:

outcome	feature 1	feature 2	feature 3	\(\dots\)	feature 5
\(y_{1}\)	\(x_{1,1}\)	\(x_{1,2}\)	\(x_{1,3}\)	\(\dots\)	\(x_{1,p}\)
\(y_{2}\)	\(x_{2,1}\)	\(x_{2,2}\)	\(x_{2,3}\)	\(\dots\)	\(x_{2,p}\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)
\(y_n\)	\(x_{n,1}\)	\(x_{n,2}\)	\(x_{n,3}\)	\(\dots\)	\(x_{n,p}\)

When the outcome is continuous, we refer to the task as prediction. The model returns a function \(f\) that produces a prediction \(\hat{y} = f(x_1, x_2, \dots, x_p)\) for any feature vector. We call \(y\) the actual outcome. Predictions \(\hat{y}\) are rarely exact, so we measure accuracy by the error, defined as \(y - \hat{y}\).

When the outcome is categorical, the task is called classification. The model produces a decision rule that prescribes which of the \(K\) classes should be predicted. Typically, models output functions \(f_k(x_1, \dots, x_p)\), one for each class \(k\). In the binary case, a common rule is: if \(f_1(x_1, \dots, x_p) > C\), predict class 1, otherwise predict class 2, with \(C\) a chosen cutoff. Here, predictions are either right or wrong.

It is worth noting that terminology varies across textbooks and courses. Sometimes prediction is used for both categorical and continuous outcomes. The term regression is also used for continuous outcomes, but here we avoid it to prevent confusion with linear regression. In most contexts, whether outcomes are categorical or continuous will be clear, so we will simply use prediction or classification as appropriate.