25  Notation and terminology

In \(\ref{@mnist}\), we introduced the MNIST handwritten digits dataset. Here we describe how the task of automatically reading these digits can be framed as a machine learning challenge. In doing so, we introduce machine learning mathematical notation and terminology used throughout this part of the book.

Originally, mail sorting in the post office involved humans reading zip codes written on the envelopes. Today, thanks to machine learning algorithms, a computer can read zip codes and then a robot sorts the letters. We will learn how to build algorithms that can read a digitized handwritten digit.

25.1 Terminology

In machine learning, data comes in the form of the outcome we want to predict and the features that we will use to predict the outcome. We build algorithms that take feature values as input and returns a prediction for the outcome when we don’t know the outcome. The machine learning approach is to train an algorithm using a dataset for which we do know the outcome, and then apply this algorithm in the future to make a prediction when we don’t know the outcome.

Prediction problems can be divided into categorical and continuous outcomes. For categorical outcomes, \(Y\) can be any one of \(K\) classes. The number of classes can vary greatly across applications. For example, in the digit reader data, \(K=10\) with the classes being the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. In speech recognition, the outcomes are all possible words or phrases we are trying to detect. Spam detection has two outcomes: spam or not spam. In this book, we denote the \(K\) categories with indexes \(k=1,\dots,K\). However, for binary data we will use \(k=0,1\) for mathematical conveniences that we demonstrate later.

25.2 Notation

Here we will use \(Y\) to denote the outcome and \(X_1, \dots, X_p\) to denote features. Note that features are sometimes referred to as predictors or covariates. We consider all these to be synonyms.

The first step in building an algorithm is to understand what are the outcomes and features. In Section 20.1, we showed that associated with each digitized image \(i\), there is a categorical outcome \(Y_i\) and features \(X_{i,1}, \dots, X_{i,p}\), with \(p=784\). We use bold face \(\mathbf{X}_i = (X_{i,1}, \dots, X_{i,p})\) to denote the vector of predictors. Notice that we are using the matrix notation described in Section 20.5. When referring to an arbitrary set of features rather than a specific image, we drop the index \(i\) and use \(Y\) and \(\mathbf{X} = (X_{1}, \dots, X_{p})\). We use upper case variables because, in general, we think of the outcome and predictors as random variables. We use lower case, for example \(\mathbf{X} = \mathbf{x}\), to denote observed values. Although, when we code, we adhere to lower case.

The machine learning task is to build an algorithm that returns a prediction for any of the possible values of the features. Here, we will learn several approaches to building these algorithms. Although at this point it might seem impossible to achieve this, we will start with basic examples and build up our knowledge until we can tackle more complex ones. In fact, we start with an artificially simple example with just one predictor and then move on to a slightly more realistic example with two predictors. Once we understand these, we will address real-world machine learning challenges involving many predictors.

25.3 The machine learning challenge

The general setup is as follows. We have a series of features and an unknown outcome we want to predict:

outcome feature 1 feature 2 feature 3 \(\dots\) feature p
? \(X_1\) \(X_2\) \(X_3\) \(\dots\) \(X_p\)

To build a model that provides a prediction for any set of observed values \(X_1=x_1, X_2=x_2, \dots X_p=x_p\), we collect data for which we know the outcome:

outcome feature 1 feature 2 feature 3 \(\dots\) feature 5
\(y_{1}\) \(x_{1,1}\) \(x_{1,2}\) \(x_{1,3}\) \(\dots\) \(x_{1,p}\)
\(y_{2}\) \(x_{2,1}\) \(x_{2,2}\) \(x_{2,3}\) \(\dots\) \(x_{2,p}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\)
\(y_n\) \(x_{n,1}\) \(x_{n,2}\) \(x_{n,3}\) \(\dots\) \(x_{n,p}\)

When the output is continuous, we refer to the machine learning task as prediction, and the main output of the model is a function \(f\) that automatically produces a prediction, denoted with \(\hat{y}\), for any set of predictors: \(\hat{y} = f(x_1, x_2, \dots, x_p)\). We use the term actual outcome to denote what we end up observing. So we want the prediction \(\hat{y}\) to match the actual outcome \(y\) as best as possible. Because our outcome is continuous, our predictions \(\hat{y}\) will not be either exactly right or wrong, but instead we will determine an error defined as the difference between the prediction and the actual outcome \(y - \hat{y}\).

When the outcome is categorical, we refer to the machine learning task as classification, and the main output of the model will be a decision rule which prescribes which of the \(K\) classes we should predict. In this scenario, most models provide functions of the predictors for each class \(k\), \(f_k(x_1, x_2, \dots, x_p)\), that are used to make this decision. When the data is binary, a typical decision rules looks like this: if \(f_1(x_1, x_2, \dots, x_p) > C\), predict category 1, if not the other category, with \(C\) a predetermined cutoff. Because the outcomes are categorical, our predictions will be either right or wrong.

Notice that these terms vary among courses, textbooks, and other publications. Often prediction is used for both categorical and continuous outcomes, and the term regression can be used for the continuous case. Here we avoid using regression to avoid confusion with our previous use of the term linear regression. In most cases, it will be clear if our outcomes are categorical or continuous, so we will avoid using these terms when possible.