Statistical Learning: Algorithmic and Nonparametric Approaches

In this web page you will find

The class outline. For each section, you can obtain the class notes in pdf and the R code used to generate the analyses and graphs.
Links for homework : data needed, assignments sheets in pdf, and the latex files.
Books often referenced
Computing resource.
Class general information.

Lecture	Title	Description	Notes	Code
NA	Review	Stuff you should know: Basics of probability, the central limit theorem, and inference	PDF	NA
1	Introduction to Regression and Prediction	We will describe linear regression in the context of a prediction problem.	PDF	R
2	Overview of Supervised Learning	Regression for predicting bivariate data, K nearest neighbors (KNN), bin smoothers, and an introduction to the bias/variance trade-off.	PDF	R
3-4	Linear Methods for Regression	Subset selection and ridge regression. We will use singular value decomposition (SVD) and principal component analysis (PCA) to understand these methods.	PDF	R
5	Linear Methods for Classification	Linear Regression, Linear Discriminant Analysis (LDA), and Logisitc Regression	PDF	R
6	Kernel Methods	Kernal smoothers including loess. We will briefly describe 2 dimensional smoothers. We will also define degrees of freedom in the context of smoothing and learn about density estimators.	PDF	R
7	Model Assessment and Selection	We revist the bias-variance tradeoff. We describe how monte-carlo simulations can be used to assess bias and variance. We then introduce cross-validation, AIC, and BIC.	PDF	R
8	The Bootstrap	We give a short introduction to the bootstrap and demonstrate its utility in smoothing problems.	PDF	R
9-10	Splines, Wavelets, and Friends	We give intuitive and mathematical description of Splines and Wavelets. We use the SVD to understand these better and see connections with signal processing methods.	PDF	R
11-12	Additive Models, GAM and Neural Networks	We move back to cases with many covariates. We introduce projection pursuit, additive models as well as generalized additive models. We breifly describe neural networks and explain the connection to projection pursuit.	PDF	NA
13-14	CART, Boosting and Additive Trees	We introduce classification algorithms and regression trees (CART) as well as the more modern versions such as random forrests.	PDF	archive for CART, archive for others
15	Model Averaging	Bayesian Statistics, Boosting and Bagging	PDF	NA
16	Clustering algorithms	Notes and code taking from my My microarray class	PDF	R

Homework:

Homework 1 [Due 4/10]: Look through the top journal in your field for a paper in which a regression analysis was performed, many covariates were available, and p-values were reported.
If your field is mathematical (statistics, biostatistics, engineering,etc..) then look through the top journal of your favorite public health application. If you don't have one then use American Journal of Epidemiology (there should be plenty of regrssion analyses in this journal).
- Discuss how the model was motivated. Deductively, empirically, both or neither?
- Give me your thoughts on their model choice? Could they have done something differently? Are the results described model driven?
- Where does the p in p-value come from? i.e. Where does the randomness come from? Random sample, randomization, or nature...? If nature, then write a paragraph explain how.
Homework 2 [Due 4/17]
- Use this training data to predict the outcomes for this data. You should give the 500 predictions and an estimate of the number of mistakes you've made. Please send a text file with only the predictions (separated by spaces). Include a description of what you did. Whomever predicts best wins first prize. Whomever best estimates the number of mistakes they make comes in second. Prizes will be handed out.
- Derive the discrimination function for LDA (third equation on page 75) and show it is linear.
- Show that LDA and regression are equivalent when the outcomes are binary.
Homework 3 [Due 4/24]
- Dowload the Strontium Data [text file] and fit a polynomial of degree 1,2,3,4,6,12, a spline (you pick the knots) and smoothing splines. Make plots of the data and the fitted curves.
- Write a paragraph describing your project.
Homework 4 [Due 5/1]
- From this data, get your best estimate of y (yhat) and confidence bands, for each of the given x-values. First prize goes to the smallest RSS, second prize goes to true f(x) entirely inside the confidence bands with smallest area between bands.
- Turn in project first draft.
Project [Last day of class]

Data-sets:

All data except vowel training [Zip file]
Prostate Cancer Data [Description, R image, CSV file] ]
Vowel Training Data [Description, Train, Test, R Image]
Strontium data[text file]
CD4 data[text file]
All Mouse data [text file]
Mouse Body Temperature data [text file]
Diabetes data [text file]
Kyphosis data [text file]
Microarray data [text file]
Cholostyramine data [text file]
Intensity data [text file]
gam.datasets data [Splus file]
Polution data sets [data in csv,variable descriptions]

Recommended Books

T. Hastie, R. Tibshirani, and J. H. Fried. (2001) The Elements of Statistical Learning. Springer-Verlag: New York. [ Web Page]
Venables, W.N. and Ripley, B.D. (2002) Modern Applied Statistics with S-Plus. Springer-Verlag: New York.
Brian D. Ripley. (1996) Pattern Recognition and Neural Networks. Cambridge University Press.

Resources

Class General Info

Course title: Statistical Learning: Algorithmic and Nonparametric Approaches (140.649)

Lab Hour:

Instructor: Rafael Irizarry
Department of Biostatistics
Phone 410-614-5157, email: rafa@jhu.edu
I assume you know: Linear algebra and statistical principles at a 651--654 level.
It will be useful to learn one of the following programming languages: R (recommended), S-Plus, or MATLAB.
Grading: 3 homeworks 60%, 2 quizzes 20%, 1 project 20%
Course description: Teaches public health students to use modern, computationally-based methods for exploring and drawing inferences from data. After a brief review of probability, the central limit theorem, and inference, the course covers resampling methods, non-parametric regression, prediction, and dimension reduction and clustering. Specifically covers: Monte Carlo simulation, bootstrap cross-validation, splines, local weighted regression, CART, random forests, neural networks, support vector machines, and hierarchical clustering.

Last updated: 4/18/2006