Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

1Foundation of Mathematical Biology / The Elements of Statistical Learning

.pdf
Скачиваний:
48
Добавлен:
15.08.2013
Размер:
287.66 Кб
Скачать

Outline

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

by T. Hastie, R. Tibshirani, J. Friedman (2001). New York: Springer

www.springer-ny.com

Read Chapters 1-2

Linear Regression [Chapter 3]

Model Assessment and Selection [Chapter 7]

Additive Models and Trees [Chapter 9]

Classification [Chapters 4, 12]

Note: NOT data analysis class; see www.biostat.ucsf.edu/services.html

PCR Calibration

8

6

Number

4

Log Copy

 

 

2

0

15

20

25

30

35

40

45

Cycle Threshold

Example: Calibrating PCR

x: number of PCR cycles to achieve threshold y: gene expression (log copy number)

Typical output from statistics package:

Residual SE = 0.3412, R-Square = 0.98 F-statistic = 343.7196 on 1 and 5 df

 

coef std.err

t.stat p.val

Intcpt

10.1315

0.3914

25.8858

0

x

-0.2430

0.0131

-18.5397

0

Fitted regression line: y = 10:1315 0:2430x

All very nice and easy.

Example: Calibrating PCR continued

In reality, things are usually more complicated:

Have calibrations corresponding to several genes and/or plates ) how to synthesize?

Potentially have failed / flawed experiments

) how to handle?

Often have measures on additional covariates (e.g., temperature) ) how to accommodate?

Can have non-linear relationships (e.g., copy number itself), non-constant error variances (e.g., greater variability at higher cycles), and non-independent data (e.g., duplicates)

) how to generalize?

PCR Calibration

 

8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

IL14

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

NKCC1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3

 

 

stat-1b

 

 

 

 

 

 

 

14 352

 

 

 

 

 

4

 

 

chymase

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

IL9RA

 

 

 

 

 

 

6

 

41

352

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Number

4

 

 

 

41

352

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Copy

 

 

 

 

 

 

41

352

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Log

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

4 1

253

 

 

2 5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

13

 

4

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

13542

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

15

20

25

30

35

40

45

 

 

 

 

 

 

 

 

Cycle Threshold

 

 

 

 

 

 

 

 

Simple Linear Regression

Data: f(xi; yi)gN

i=1

xi: explanatory / feature variable; covariate; input yi: response variable; outcome; output

sample of N covariate, outcome pairs.

Linear Model:

yi = β0 + β1xi + εi

Errors: ε – mean zero, constant variance σ2

Coefficients : β0 – intercept; β1 – slope

Estimation: least squares – select coefficients to

minimize the Residual Sum of Squares (RSS)

N

RSS0; β1) = (yi 0 + β1xi))2

i=1

Solution, Assumptions: exercise

Multiple Linear Regression (Secn 3.2)

N

where now each xi is a

Data: f(xi; yi)gi=1

covariate vector xi = (xi1; xi2; : : : ; xip)T

Linear Model:

p

yi = β0 + βjxi j + εi

j=1

y = Xβ + ε

where X is N (p + 1) matrix with rows (1; xi);

y = (y1; y2; : : : ; yN )T ; ε = (ε1; ε2; : : : ; εN )T

Estimation of β = (β0; β1; : : : ; βp)T : minimize

RSS(β) = (y Xβ)T (y Xβ)

Least Squares Solution: βˆ = (XT X) 1XT y

Issues: inference, model selection, prediction, interpretation, assumptions/diagnostics, : : :

Inference

Sampling Variability: Assume yi uncorrelated, constant variance σ2 Covariance matrix of βˆ is

Varˆ ) = (XT X) 1σ2

Unbiased estimate of σ2 is

ˆ 2 =

1

N

 

 

 

)2

(

 

 

 

 

 

 

σ

N p 1

yi

 

yˆi

 

 

i=1

 

 

 

 

 

 

 

 

 

 

where (yˆi) = = Xβˆ = X(XT X) 1XT y = Hy

H projects y onto in the column space of X.

Tests, Intervals: Now assume εi independent, identically distributed N(0; σ2) (ε N(0; σ2I))

Then

 

 

 

 

ˆ

 

T

1 2)

 

 

 

 

β

N(β; (X X)

σ

 

N

 

p

 

1

σ

 

σ χN

p 1

 

(

 

 

 

 

 

) ˆ

2

2 2

 

 

 

 

 

 

 

 

 

 

 

 

 

To test if the jth coefficient

βj = 0 use

 

 

βj

z j =

 

ˆ

 

 

ˆ

p

 

 

v j

 

σ

 

where v j is the jth diagonal element of (XT X) 1. To simultaneously test sets of coefficients (e.g. related variables): p1 + 1 terms in larger subset; p0 + 1 in smaller subset. RSS1 (RSS0): residual sum of squares for large (small) model. Use F

statistic:

 

F =

(RSS0 RSS1)=(p1 p0)

 

 

RSS1=(N p1 1)

 

 

 

 

 

 

 

 

which (if small model correct) Fp

p

;N p

1.

 

 

 

 

 

 

 

1

0

 

1

 

A 1 2α confidence interval for

βj:

 

 

 

 

(ˆ

 

 

ˆ

; ˆ

+

z1

 

ˆ )

 

 

 

βj

 

z1 αv jσ

βj

 

αv jσ

 

 

 

where z1 α is 1 α percentile of the normal dn.

Variable Selection (Secn 3.4)

Objectives: improve prediction, interpretation Achieved by eliminating/reducing role of lesser/ redundant variables. Many different strategies/criteria.

Subset Selection: retain only a subset. Estimate coefs of retained variables with least squares as usual.

Best subset regression: for k = 1; : : : p find those k variables giving smallest RSS. Feasible for p 30.

Forward stepwise selection: start with just intercept; sequentially add variables (one at a time) that most improve fit as measured by F statistic (7). Stop when no variable significant at (say) 10% level.

Backward stepwise elimination: start with full model; sequentially delete variables that contribute least.