High dimensional data

There is a variety of computational techniques and statistical concepts that are useful for analysis of datasets for which each observation is associated with a large number of numerical variables. In this chapter, we provide a basic introduction to these techniques and concepts by describing matrix operations in R, dimension reduction, regularization, and matrix factorization. Handwritten digits data and movie recommendation systems serve as motivating examples.

A task that serves as motivation for this part of the book is quantifying the similarity between any two observations. For example, we might want to know how much two handwritten digits look like each other. However, note that each observation is associated with \(28 \times 28 = 784\) pixels so we can’t simply use subtraction as we would if our data was one dimensional. Instead, we will define observations as points in a high-dimensional space and mathematically define a distance. Many machine learning techniques, discussed in the next part of the book, require this calculation.

Additionally, this part of the book discusses dimension reduction. Here we search for data summaries that provide more manageable lower dimension versions of the data, but preserve most or all the information we need. We again use distance between observations as a specific example: we will summarize the data into lower dimensions, but in a way that preserves distance between any two observations. We use linear algebra as a mathematical foundation for all the techniques presented here.