Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Казанский национальный исследовательский технологический университет

Предмет:

Химия

Файл:

1Foundation of Mathematical Biology / The Elements of Statistical Learning

.pdf

Скачиваний:

Добавлен:

15.08.2013

Размер:

287.66 Кб

Скачать

☆

1 / 61 2 3 4 5 6 > Следующая >>>

Outline

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

by T. Hastie, R. Tibshirani, J. Friedman (2001). New York: Springer

www.springer-ny.com

Read Chapters 1-2

Linear Regression [Chapter 3]

Model Assessment and Selection [Chapter 7]

Additive Models and Trees [Chapter 9]

Classiﬁcation [Chapters 4, 12]

Note: NOT data analysis class; see www.biostat.ucsf.edu/services.html

PCR Calibration

•

Number	4
Log Copy
	2

•

Cycle Threshold

Example: Calibrating PCR

x: number of PCR cycles to achieve threshold y: gene expression (log copy number)

Typical output from statistics package:

Residual SE = 0.3412, R-Square = 0.98 F-statistic = 343.7196 on 1 and 5 df

	coef std.err		t.stat p.val
Intcpt	10.1315	0.3914	25.8858	0
x	-0.2430	0.0131	-18.5397	0

Fitted regression line: y = 10:1315 0:2430x

All very nice and easy.

Example: Calibrating PCR continued

In reality, things are usually more complicated:

Have calibrations corresponding to several genes and/or plates ) how to synthesize?

Potentially have failed / ﬂawed experiments

) how to handle?

Often have measures on additional covariates (e.g., temperature) ) how to accommodate?

Can have non-linear relationships (e.g., copy number itself), non-constant error variances (e.g., greater variability at higher cycles), and non-independent data (e.g., duplicates)

) how to generalize?

PCR Calibration

	8
	8						1		IL14


							2		NKCC1
							3		stat-1b
		14 352					4		chymase
							5		IL9RA
	6	41	352



Number	4		41	352
			41	352
Copy				41		352
Copy				41		352
Log
	2					4 1		253		2 5
	2					4 1		253
								13			4
	0										13542
	0



	15		20	25		30		35		40	45
					Cycle Threshold

Simple Linear Regression

Data: f(xi; yi)gN

i=1

xi: explanatory / feature variable; covariate; input yi: response variable; outcome; output

sample of N covariate, outcome pairs.

Linear Model:

yi = β0 + β1xi + εi

Errors: ε – mean zero, constant variance σ2

Coefﬁcients : β0 – intercept; β1 – slope

Estimation: least squares – select coefﬁcients to

minimize the Residual Sum of Squares (RSS)

RSS(β0; β1) = ∑(yi (β0 + β1xi))2

i=1

Solution, Assumptions: exercise

Multiple Linear Regression (Secn 3.2)

N	where now each xi is a
Data: f(xi; yi)gi=1

covariate vector xi = (xi1; xi2; : : : ; xip)T

Linear Model:

yi = β0 + ∑ βjxi j + εi

j=1

y = Xβ + ε

where X is N (p + 1) matrix with rows (1; xi);

y = (y1; y2; : : : ; yN )T ; ε = (ε1; ε2; : : : ; εN )T

Estimation of β = (β0; β1; : : : ; βp)T : minimize

RSS(β) = (y Xβ)T (y Xβ)

Least Squares Solution: βˆ = (XT X) 1XT y

Issues: inference, model selection, prediction, interpretation, assumptions/diagnostics, : : :

Inference

Sampling Variability: Assume yi uncorrelated, constant variance σ2 Covariance matrix of βˆ is

Var(βˆ ) = (XT X) 1σ2

Unbiased estimate of σ2 is

ˆ 2 =	1	N			)2
	1	(
		(
σ	N p 1	∑	yi	yˆi
	N p 1	i=1
		i=1

where (yˆi) = yˆ = Xβˆ = X(XT X) 1XT y = Hy

H projects y onto yˆ in the column space of X.

Tests, Intervals: Now assume εi independent, identically distributed N(0; σ2) (ε N(0; σ2I))

Then				ˆ			T	1 2)
Then				β		N(β; (X X)		σ
	N	p	1	σ		σ χN	p 1
(				) ˆ	2	2 2

To test if the jth coefﬁcient			βj = 0 use
		βj
z j =		ˆ
	ˆ	p
	ˆ	p	v j
	σ		v j

where v j is the jth diagonal element of (XT X) 1. To simultaneously test sets of coefﬁcients (e.g. related variables): p1 + 1 terms in larger subset; p0 + 1 in smaller subset. RSS1 (RSS0): residual sum of squares for large (small) model. Use F

statistic:	F =	(RSS0 RSS1)=(p1 p0)
statistic:	F =		RSS1=(N p1 1)
			RSS1=(N p1 1)
which (if small model correct) Fp							p	;N p	1.
						1	0		1
A 1 2α conﬁdence interval for						βj:
(ˆ		ˆ	; ˆ	+	z1		ˆ )
βj	z1 αv jσ		βj		z1	αv jσ

where z1 α is 1 α percentile of the normal dn.

Variable Selection (Secn 3.4)

Objectives: improve prediction, interpretation Achieved by eliminating/reducing role of lesser/ redundant variables. Many different strategies/criteria.

Subset Selection: retain only a subset. Estimate coefs of retained variables with least squares as usual.

Best subset regression: for k = 1; : : : p ﬁnd those k variables giving smallest RSS. Feasible for p 30.

Forward stepwise selection: start with just intercept; sequentially add variables (one at a time) that most improve ﬁt as measured by F statistic (7). Stop when no variable signiﬁcant at (say) 10% level.

Backward stepwise elimination: start with full model; sequentially delete variables that contribute least.

1 / 61 2 3 4 5 6 > Следующая >>>

Соседние файлы в папке 1Foundation of Mathematical Biology

#
15.08.2013248.78 Кб46Foundation of Mathematical Biology Statistics Lecture 3-4.pdf
#
15.08.20132.11 Mб45Foundation of Mathematical Biology.pdf
#
15.08.2013287.66 Кб48The Elements of Statistical Learning.pdf