1Foundation of Mathematical Biology / The Elements of Statistical Learning
.pdfOutline
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
by T. Hastie, R. Tibshirani, J. Friedman (2001). New York: Springer
www.springer-ny.com
Read Chapters 1-2
Linear Regression [Chapter 3]
Model Assessment and Selection [Chapter 7]
Additive Models and Trees [Chapter 9]
Classification [Chapters 4, 12]
Note: NOT data analysis class; see www.biostat.ucsf.edu/services.html
PCR Calibration
8
•
6
•
Number |
4 |
Log Copy |
|
|
2 |
0
•
•
•
•
•
15 |
20 |
25 |
30 |
35 |
40 |
45 |
Cycle Threshold
Example: Calibrating PCR
x: number of PCR cycles to achieve threshold y: gene expression (log copy number)
Typical output from statistics package:
Residual SE = 0.3412, R-Square = 0.98 F-statistic = 343.7196 on 1 and 5 df
|
coef std.err |
t.stat p.val |
||
Intcpt |
10.1315 |
0.3914 |
25.8858 |
0 |
x |
-0.2430 |
0.0131 |
-18.5397 |
0 |
Fitted regression line: y = 10:1315 0:2430x
All very nice and easy.
Example: Calibrating PCR continued
In reality, things are usually more complicated:
Have calibrations corresponding to several genes and/or plates ) how to synthesize?
Potentially have failed / flawed experiments
) how to handle?
Often have measures on additional covariates (e.g., temperature) ) how to accommodate?
Can have non-linear relationships (e.g., copy number itself), non-constant error variances (e.g., greater variability at higher cycles), and non-independent data (e.g., duplicates)
) how to generalize?
PCR Calibration
|
8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
IL14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
2 |
|
|
NKCC1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
|
|
stat-1b |
|
|
|
|
|
|
|
14 352 |
|
|
|
|
|
4 |
|
|
chymase |
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
5 |
|
|
IL9RA |
|
|
|
|
|
|
6 |
|
41 |
352 |
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
Number |
4 |
|
|
|
41 |
352 |
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
Copy |
|
|
|
|
|
|
41 |
352 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
Log |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
4 1 |
253 |
|
|
2 5 |
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
13 |
|
4 |
|||||
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13542 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
20 |
25 |
30 |
35 |
40 |
45 |
|||||||||||
|
|
|
|
|
|
|
|
Cycle Threshold |
|
|
|
|
|
|
|
|
Simple Linear Regression
Data: f(xi; yi)gN
i=1
xi: explanatory / feature variable; covariate; input yi: response variable; outcome; output
sample of N covariate, outcome pairs.
Linear Model:
yi = β0 + β1xi + εi
Errors: ε – mean zero, constant variance σ2
Coefficients : β0 – intercept; β1 – slope
Estimation: least squares – select coefficients to
minimize the Residual Sum of Squares (RSS)
N
RSS(β0; β1) = ∑(yi (β0 + β1xi))2
i=1
Solution, Assumptions: exercise
Multiple Linear Regression (Secn 3.2)
N |
where now each xi is a |
Data: f(xi; yi)gi=1 |
covariate vector xi = (xi1; xi2; : : : ; xip)T
Linear Model:
p
yi = β0 + ∑ βjxi j + εi
j=1
y = Xβ + ε
where X is N (p + 1) matrix with rows (1; xi);
y = (y1; y2; : : : ; yN )T ; ε = (ε1; ε2; : : : ; εN )T
Estimation of β = (β0; β1; : : : ; βp)T : minimize
RSS(β) = (y Xβ)T (y Xβ)
Least Squares Solution: βˆ = (XT X) 1XT y
Issues: inference, model selection, prediction, interpretation, assumptions/diagnostics, : : :
Inference
Sampling Variability: Assume yi uncorrelated, constant variance σ2 Covariance matrix of βˆ is
Var(βˆ ) = (XT X) 1σ2
Unbiased estimate of σ2 is
ˆ 2 = |
1 |
N |
|
|
|
)2 |
( |
|
|
|
|||
|
|
|
||||
σ |
N p 1 |
∑ |
yi |
|
yˆi |
|
|
i=1 |
|
|
|
|
|
|
|
|
|
|
|
where (yˆi) = yˆ = Xβˆ = X(XT X) 1XT y = Hy
H projects y onto yˆ in the column space of X.
Tests, Intervals: Now assume εi independent, identically distributed N(0; σ2) (ε N(0; σ2I))
Then |
|
|
|
|
ˆ |
|
T |
1 2) |
||
|
|
|
|
β |
N(β; (X X) |
σ |
||||
|
N |
|
p |
|
1 |
σ |
|
σ χN |
p 1 |
|
( |
|
|
|
|
|
) ˆ |
2 |
2 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
To test if the jth coefficient |
βj = 0 use |
||||
|
|
βj |
|||
z j = |
|
ˆ |
|
|
|
ˆ |
p |
|
|
||
v j |
|||||
|
σ |
|
where v j is the jth diagonal element of (XT X) 1. To simultaneously test sets of coefficients (e.g. related variables): p1 + 1 terms in larger subset; p0 + 1 in smaller subset. RSS1 (RSS0): residual sum of squares for large (small) model. Use F
statistic: |
|
F = |
(RSS0 RSS1)=(p1 p0) |
||||||||
|
|
RSS1=(N p1 1) |
|
|
|||||||
|
|
|
|
|
|
||||||
which (if small model correct) Fp |
p |
;N p |
1. |
||||||||
|
|
|
|
|
|
|
1 |
0 |
|
1 |
|
A 1 2α confidence interval for |
βj: |
|
|
|
|
||||||
(ˆ |
|
|
ˆ |
; ˆ |
+ |
z1 |
|
ˆ ) |
|
|
|
βj |
|
z1 αv jσ |
βj |
|
αv jσ |
|
|
|
where z1 α is 1 α percentile of the normal dn.
Variable Selection (Secn 3.4)
Objectives: improve prediction, interpretation Achieved by eliminating/reducing role of lesser/ redundant variables. Many different strategies/criteria.
Subset Selection: retain only a subset. Estimate coefs of retained variables with least squares as usual.
Best subset regression: for k = 1; : : : p find those k variables giving smallest RSS. Feasible for p 30.
Forward stepwise selection: start with just intercept; sequentially add variables (one at a time) that most improve fit as measured by F statistic (7). Stop when no variable significant at (say) 10% level.
Backward stepwise elimination: start with full model; sequentially delete variables that contribute least.