26 Notation and Terminology

In Section 21.2, we introduced the MNIST handwritten digits dataset. We now return to this dataset and use the task of recognizing handwritten digits as our running example for the next few chapters. Our goal in this section is not yet to develop a full algorithm, but to introduce the machine learning notation and terminology that we will rely on throughout this part of the book.

Handwritten digit recognition is a classic machine learning problem. Decades ago, mail sorting in the post office required humans to read every zip code written on envelopes. Today, machines perform this task automatically: a computer reads the digitized image of the zip code, and a robot sorts the letter accordingly.

In the pages that follow, we will describe how to formalize this problem mathematically, define the objects we need to work with, and introduce the language used to describe inputs, outputs, features, predictions, and learning algorithms. This will set the foundation for the machine learning methods developed in later chapters.

26.1 Terminology

In machine learning, data comes in the form of the outcome we want to predict and the features that we will use to predict it. We build algorithms that take feature values as input and return a prediction for the outcome. The machine learning approach is to train an algorithm using a dataset for which we do know the outcome, and then apply this algorithm in the future to make a prediction when we don’t know the outcome.

Prediction problems can be divided into categorical and continuous outcomes. Categorical outcomes can be any one of \(K\) classes. The number of classes can vary greatly across applications. For example, in the digit reader data, \(K=10\) with the classes being the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. In speech recognition, the outcomes are all possible words or phrases we are trying to detect. Spam detection has two outcomes: spam or not spam. In this book, we denote the \(K\) categories with indexes \(k=1,\dots,K\). However, for binary data, we will use \(k=0,1\) for mathematical convenience, which we demonstrate later.

26.2 Notation

We will use \(Y\) to denote the outcome and \(X_1, \dots, X_p\) to denote the features. These features are also sometimes referred to as predictors or covariates, and we will treat these terms as synonyms.

The first step in building a machine learning algorithm is to clearly identify what the outcomes and features are. In Section 21.2, we showed that each digitized image \(i\) is associated with a categorical outcome \(Y_i\) and a set of features \(X_{i1}, \dots, X_{ip}\), with \(p=784\). For convenience, we often use boldface notation \(\mathbf{X}_i = (X_{i1}, \dots, X_{ip})^\top\) to represent the vector of predictors, following the notation introduced in Section 21.1.

When referring to an arbitrary set of features rather than a specific image, we drop the index \(i\) and simply use \(Y\) and \(\mathbf{X} = (X_1, \dots, X_p)\). We use uppercase to emphasize that we think of these as random variables. Observed values are denoted in lowercase, such as \(Y=y\) and \(\mathbf{X} = \mathbf{x}\). In practice, when writing code, we typically always use lowercase.

To denote predictions, we use hats, just as we do for parameter estimates. For a specific observation \(i\) with predictors \(\mathbf{x}_i\), the prediction is written as \(\hat{y}_i\). For an arbitrary predictor vector \(\mathbf{x}\), we write the prediction as \(\hat{y}(\mathbf{x})\), emphasizing that the prediction is a function of the predictors.

26.3 The machine learning challenge

The machine learning task is to build an algorithm that predicts the outcome given any combination of features. At first glance, this might seem impossible, but we will start with very simple examples and gradually build toward more complex cases. We begin with one predictor, then extend to two predictors, and eventually tackle real-world challenges involving thousands of predictors.

The general setup is as follows. We have a series of features and an unknown outcome we want to predict:

outcome	feature 1	feature 2	feature 3	\(\dots\)	feature p
?	\(X_1\)	\(X_2\)	\(X_3\)	\(\dots\)	\(X_p\)

To build a model that predicts outcomes from observed features \(X_1=x_1, X_2=x_2, \dots, X_p=x_p\), we need a dataset where the outcomes are known:

outcome	feature 1	feature 2	feature 3	\(\dots\)	feature p
\(y_{1}\)	\(x_{1,1}\)	\(x_{1,2}\)	\(x_{1,3}\)	\(\dots\)	\(x_{1,p}\)
\(y_{2}\)	\(x_{2,1}\)	\(x_{2,2}\)	\(x_{2,3}\)	\(\dots\)	\(x_{2,p}\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)
\(y_n\)	\(x_{n1}\)	\(x_{n2}\)	\(x_{n3}\)	\(\dots\)	\(x_{np}\)

When the outcome is continuous, we refer to the task as prediction. The model returns a function \(f\) that produces a prediction \(\hat{y}(\mathbf{x}) = f(x_1, x_2, \dots, x_p)\) for any feature vector. We call \(y\) the actual outcome. Predictions \(\hat{y}(\mathbf{x})\) are rarely exact, so we measure accuracy by the error, defined as \(y - \hat{y}(\mathbf{x})\).

When the outcome is categorical, the task is called classification. The model produces a decision rule that prescribes which of the \(K\) possible classes should be predicted.

Typically, a classification model outputs one score per class, denoted \(f_k(x_1, \dots, x_p)\) for class \(k\). To make a prediction at a predictor vector \(\mathbf{x} = (x_1,\dots,x_p)\), we select the class with the largest score:

\[ \hat{y}(\mathbf{x}) = \arg\max_{k \in {1,\dots,K}} f_k(\mathbf{x}). \]

In the binary case (\(K=2\)), this is often written using a cutoff: if \(f_1(x_1,\dots,x_p) > c\) we predict class 1, otherwise we predict class 2. Here, predictions are simply correct or incorrect when compared to the true class.

It is worth noting that terminology varies across textbooks and courses. Sometimes, prediction is used for both categorical and continuous outcomes. The term regression is also used for continuous outcomes, but here we avoid it to prevent confusion with linear regression. In most contexts, whether outcomes are categorical or continuous will be clear, so we will simply use prediction or classification as appropriate.

26.4 Statistical vs. Machine Learning language

Up to this point in the book, we have used terminology that comes from statistics: models, estimators, and parameters. As we transition to machine learning, we encounter a new vocabulary. Some of these new terms overlap with familiar statistical ideas, while others reflect differences in emphasis between the two fields. In this section, we clarify the terminology we will use throughout the machine learning chapters.

In statistics, a model refers to a mathematical description of how the data are generated, for example, a linear regression model with coefficients that represent the relationship between predictors and the outcome. In machine learning, the word model is used similarly, but with a slightly more practical emphasis: a model is the fitted object that we use to make predictions. For instance, when we fit a logistic regression to predict a binary outcome, the resulting estimated regression function is what a machine learner would call the model.

A key difference arises with the word algorithm. In statistics, an algorithm is simply a computational routine used to compute an estimate. In machine learning, however, algorithm is often used more broadly. It can refer to the computational procedure used to fit a model, for example, the iterative steps used to estimate the coefficients in logistic regression, but it also commonly refers to the overall prediction strategy itself. Thus, in machine learning practice, people talk about “the logistic regression algorithm” even though in statistics, a logistic regression is thought of more as a model rather than an algorithm. This dual use of the term is standard in the field, and we follow that convention here.

Finally, machine learning literature often uses method, approach, or procedure as general terms for any systematic way of mapping predictors to predictions. In earlier chapters, we introduced regression and logistic regression as statistical models. From the machine learning point of view, these are also prediction methods, functions we learn from data that can be applied to new cases.

Another important pair of terms is supervised versus unsupervised learning. In supervised learning, the goal is as described above: to predict an outcome we can observe in the training data. Regression and logistic regression fall into this category, and most of the machine learning techniques we discuss in the next chapters will also be supervised methods. In unsupervised learning, by contrast, there is no outcome variable; the goal is to uncover structure in the predictors themselves. Clustering is the most common example of this. We will study supervised learning first because it connects most directly to prediction, our primary goal, before briefly describing unsupervised learning in the last chapter.

26.5 Exercises

In a machine learning context, we observe features \(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\) and outcomes \(y_1, y_2, \ldots, y_n\). What do we call the function \(f\) that we try to estimate, and what is our goal in estimating it?
Suppose we have a dataset with \(n = 1,000\) observations and \(p = 50\) features. Using standard notation, write out:
1. The feature matrix \(\mathbf{X}\)
2. The outcome vector \(\mathbf{y}\)
3. A single prediction \(\hat{y}_i\) for the \(i\)-th observation
In binary classification, we often use a cutoff \(c\) to convert probabilities into class predictions. If \(\hat{p}(\mathbf{x})\) represents our estimated probability that an observation belongs to class 1, write the decision rule that predicts class 1 when \(\hat{p}(\mathbf{x}) > c\) and class 0 otherwise. What do we call the value \(c\)?