High dimensional data

There is a variety of computational techniques and statistical concepts that are useful for analysis of datasets for which each observation is associated with a large number of numerical variables. In this chapter we provide a basic introduction to these techniques and concepts by describing matrix operations in R, dimension reduction, regularization, and matrix factorization. Handwritten digits data and movie recommendation systems serve as motivating examples.

A task that serves as motivation for this part of the book is quantifying the similarity between any two observations. For example, we might want to know how much two handwritten digits look like each other. However, note that each observations is associated with \(28 \times 28 = 784\) pixels so we can’t simply use subtraction as we would do if our data was one dimensional. Instead, we will define observations as points in a high-dimensional space and mathematically define a distance. Many machine learning techniques, discussed in the next part of the book, require this calculation.

Additionally, this part of the book discusses dimension reduction. Here we search of data summaries that result in more manageable lower dimension versions of the data, but preserve most or all the information we need. Here too we can use distance between observations as specific challenge: we will reduce the dimensions summarize the data into lower dimensions, but in a way that preserves the distance between any two observations. We use linear algebra as a mathematical foundation for all the techniques presented here.