Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Скачиваний:
36
Добавлен:
10.05.2015
Размер:
2.78 Mб
Скачать

Penalised likelihood methods for high-dimensional pattern analysis

O.V. Krasotkina

Department of Cybernatics Science, Tula State University

11/11/2012. ICPR2012 Tutorial PM-02

UCL (Statistical Science)

Penalised likelihood methods

1 / 72

 

 

 

 

 

 

 

 

Outline

1Penalised likelihood methods for regression

Methodologies

Bayesian counterparts

Computations

2Penalised likelihood methods for classification

Penalised linear logistic regression

Penalised linear discriminant analysis (LDA)

Penalised support vector machines (SVM)

3Penalised likelihood methods for clustering

Penalised model-based clustering

UCL (Statistical Science)

Penalised likelihood methods

2 / 72

 

 

 

Penalised likelihood methods for regression

Methodologies

Penalised likelihood methods

Classical methods

L0-penalty: AIC (1974); BIC (1978); Best subset selection L2-penalty: Ridge regression (1970)

Modern methods

L1-penalty: Lasso (1996)

Lq-penalty: Bridge regression (1993, 1998)

(L1-penalty + L2-penalty): Elastic net (2005)

‘Adaptive’ penalty: Relaxed lasso (2007); Adaptive lasso (2006, 2009); SCAD (2001)

‘Structured’ penalty: Fused lasso (2005, 2011); Group lasso (2006, 2012)

Stable penalised methods: Bootstrap + lasso (2008, 2010)

UCL (Statistical Science)

Penalised likelihood methods

3 / 72

 

 

 

Penalised likelihood methods for regression

Methodologies

 

 

 

 

 

Regression

Response variable: y

Explanatory variables (features): x = (x1, . . . , xp)T , p-dimensional

Regression model: y = f (x, ) +

f (x, ): eg often assume a linear function f (x, ) = β1x1 + · · · + βpxp

: unknown parameters, eg = ( )T

β1, . . . , p β

: random error term; if E ( |x) = 0, then E (y|x) = f (x, )

Parameter estimation (learning): ˆ

Observed training sample: {(yi , xi )}ni=1, a sample of n instances

UCL (Statistical Science)

Penalised likelihood methods

4 / 72

 

 

 

Penalised likelihood methods for regression

Methodologies

Parameter estimation

Given n training instances {(yi , xi )}ni=1, we estimate by

Least squares estimation: select that minimises the sum of squared

residuals:

n

ˆLS

= argmin X{yi − f (xi , )}2

i=1

Maximum likelihood estimation (MLE): select that maximises a

likelihood function L( ) (assumed concave here for simplicity)

 

 

 

 

 

 

iid

 

 

n

 

 

 

 

 

 

 

 

 

 

 

 

L( ) = p(y1, . . . , yn|x1, . . . , xn, ) =

 

i=1 p(yi |xi , )

 

 

 

 

 

 

 

 

 

 

The distribution p(y|x

ˆ

 

largest probability of producing

,

 

 

 

ML) has the Q

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

the observed sample:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

n

 

 

 

 

 

 

 

 

 

 

ˆ

argmax `( ) = argmax log

L

( ) = argmax

Xi

log p(y

i |

x

, )

 

ML =

 

 

 

 

 

 

 

 

 

 

=1

 

 

 

 

 

 

i

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

UCL (Statistical Science)

 

Penalised likelihood methods

 

 

 

 

 

 

 

 

 

 

5 / 72

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Penalised likelihood methods for regression

Methodologies

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Model selection

“Essentially, all models are wrong, but some are useful” – George Box

Aims: Given a set of candidate models, we want to select an optimal model

For di erent candidate models,is composed of di βj s; erenteg β1 = 0 and βp 6= 0 for Model A, while β1 = 0 and βp = 0 for Model B

 

ˆ(A)

ˆ(B)

Let us be intuitive and slightly naive, compare `( ML ) and `( ML ); ie

ˆ

ˆ

 

 

= argmax `( ML)

 

ˆ

ML

Problems with more features selected

overfitting

worse prediction for new instances less interpretable model

Solution: Select fewer redundant explanatory variables

Penalise larger models with more unknown parameters

UCL (Statistical Science)

Penalised likelihood methods

6 / 72

 

 

 

Penalised likelihood methods for regression

Methodologies

Selecting features: AIC, BIC & L0-penalty

Akaike information criterion (AIC; Akaike 1974)

 

 

 

AIC (ˆ

 

) =

`(ˆ

 

) + k; ˆ = argmin AIC (ˆ

) ,

 

ML

 

 

ML

 

 

 

ˆ

ML

 

 

 

 

 

 

 

 

 

 

 

 

ML

 

 

 

 

 

 

 

 

 

 

 

 

 

ˆ

 

 

 

where k is the number of parameters in ML

 

 

 

Bayesian information criterion (BIC; Schwarz 1978)

 

 

 

BIC (ˆ

) =

`(ˆ

) +

log n

k;

ˆ = argmin BIC (ˆ

)

 

ML

 

 

 

ML

 

2

 

ˆ

 

ML

 

 

 

 

 

 

 

 

 

 

 

ML

 

 

 

@Akaike, H. (1974) A new look at the statistical model identification. IEEE Trans. Automatic Control 19(6):716-723.

@Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics 6(2):461-464.

UCL (Statistical Science)

Penalised likelihood methods

7 / 72

 

 

 

Penalised likelihood methods for regression

Methodologies

 

 

 

 

 

Selecting features: AIC, BIC & L0-penalty

ˆ

ˆ

 

 

 

 

 

 

 

 

 

 

 

 

 

AIC ( ML) = −`( ML) + k

 

 

 

 

 

 

 

 

 

 

 

 

 

ˆ = argmin AIC (ˆML)

 

 

 

 

 

 

 

 

ˆ

 

 

 

 

 

 

 

 

 

 

 

 

 

ML

 

 

 

 

 

 

 

 

ˆ

ˆ

log n

k

 

 

 

 

 

 

 

 

 

BIC ( ML) = −`( ML) +

2

 

 

 

 

 

 

 

 

 

 

 

ˆ = argmin BIC (ˆML)

 

 

 

 

 

 

 

 

ˆ

 

 

 

 

 

 

 

 

 

 

 

 

 

ML

 

 

 

 

 

 

 

 

ˆ

= k k0 = |β1|

0

+ . . . + |βp|

0

, given 0

0

= 0. Let us be

k = k MLk0

 

 

 

intuitive and slightly naive again,

 

 

 

 

 

 

 

 

 

ˆ = argmin

`( ) + λ

k

 

k0

}

,

 

 

 

 

 

 

 

{ −

 

 

 

 

 

where λ ≥ 0. Problems with this ‘best subset selection’?

UCL (Statistical Science)

Penalised likelihood methods

8 / 72

 

 

 

Penalised likelihood methods for regression

Methodologies

‘Shrinking’ features: ridge regression & L2-penalty

Problems with ˆ = argmin

{ −

k

 

k0}

`( ) + λ

 

 

Feasibility: large p; high-dimensional ‘large-p-small-n’ data

Variability of selected models: discrete selection

Ridge regression (Hoerl and Kennard 1970)

 

 

 

 

 

=

 

 

#

k k2

 

 

 

 

 

ˆ

 

argmin

 

`( ) + λ

2

 

 

 

 

 

 

 

P

 

 

 

L2

 

k k2

:

 

 

 

 

 

 

 

 

L2

-penalty k k22

 

=

p

 

βj2: strictly convex; di

 

erentiable

 

j=1

 

 

-penalty

2

 

continuous

 

 

 

 

@Hoerl, A.E. and Kennard, R.W. (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 42(1):80-86.

UCL (Statistical Science)

 

Penalised likelihood methods

 

 

 

 

 

9 / 72

 

 

 

 

 

 

 

Penalised likelihood methods for regression

Methodologies

 

 

 

 

 

 

‘Shrinking’ features: ridge regression & L2-penalty

 

 

 

 

 

 

 

 

 

 

 

 

Ridge regression: ˆ = argmin

#

`( ) + λ

 

 

2

 

 

 

Shrink βˆj towards zero

 

 

 

 

 

k

 

k2

 

 

 

For a linear regression model yi = xiT + i , ridge regression becomes

 

 

ˆ

=

 

 

 

 

nk − k2

 

k k2o

 

 

 

 

 

 

argmin

y X

 

 

+ λ

 

,

where y = (y1, . . . , yn)T

and X is the design matrix (x1, . . . , xn)T

Closed-form solution:

 

 

 

 

 

 

 

 

 

 

 

 

 

ˆ

T

X)

−1

X

T

y

(Least squares)

 

 

λ = 0: LS = (X

 

 

 

 

 

 

0: ˆ = (XT X + I)−1XT y

λ > λ

Further assume orthonormal design XT X = I:

ˆ

=

1

ˆ

 

1 + λ

LS

The tuning parameter λ: the larger λ, the greater shrinkage

Problems with ridge regression?

UCL (Statistical Science)

Penalised likelihood methods

10 / 72

 

 

 

Penalised likelihood methods for regression

Methodologies

Selecting and ‘shrinking’ features: lasso & L1-penalty

Lasso: Least absolute shrinkage and selection operator (Tibshirani 1996)

 

 

 

 

ˆ = argmin

`( ) + λ

k

 

k1}

 

 

 

 

 

 

 

 

 

 

 

 

{ −

 

 

L1

-penalty k k1

 

=

 

p

j |: convex

 

 

 

 

 

 

j=1

 

 

 

 

L

1

-penalty k k1

:

continuous

 

 

 

 

 

 

 

 

P

 

 

 

 

 

 

 

 

 

 

 

 

ˆ

 

 

 

 

 

 

 

 

 

 

 

 

 

Shrink βj towards zero

 

 

 

 

 

 

 

 

 

The tuning parameter λ: the larger λ, the greater shrinkage

Unlike ridge regression, the lasso can truncate su

ˆ

βj

exactly at zero; that is, the lasso can do model selection

L2

-penalty 22

 

=

 

p

βj2: strictly convex; di

erentiable

 

P

=1

L1

k k

 

 

 

|

βj

|

: convex; singular at βj

= 0

-penalty k k1

 

= Pjpj=1

 

 

@Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58(1):267-288.

UCL (Statistical Science) Penalised likelihood methods 11 / 72

Penalised likelihood methods for regression Methodologies

Di erence between usingL2, L1 & L0 penalties

Linear regression: yi = xTi + i

Orthonormal design matrix: XT X = I

Horizontal axis: ˆ , least squares estimator or MLE under certain Gaussian

LS

assumptions

Vertical axis: ˆ , estimator obtained by ridge regression (Left-hand), lasso

(Middle) & subset selection (Right-hand), indicated by solid lines

ridge lasso subset selection

 

5

 

5

 

5

β^*

0

β^*

0

β^*

0

 

−5

 

−5

 

−5

−5

0

5

−5

0

5

−5

0

5

 

^

 

 

^

 

 

 

^

 

 

 

 

βLS

 

 

βLS

 

 

 

 

βLS

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

UCL (Statistical Science)

Penalised likelihood methods

12 / 72

 

 

 

Penalised likelihood methods for regression

Methodologies

Why lasso can shrink small ˆ exactly to be zero while

βj

ridge regression cannot?

ridge

lasso

 

2.0

 

 

 

 

2.0

 

 

 

 

2.0

 

 

 

 

 

1.5

 

 

 

 

1.5

 

 

 

 

1.5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

j

 

 

 

 

 

j

1.0

 

 

 

j

1.0

 

 

 

1.0

 

 

 

 

β2

 

 

 

β

 

 

 

2 β

 

 

 

 

 

 

 

 

 

 

 

 

 

 

j

 

 

 

 

 

 

0.5

 

 

 

 

0.5

 

 

 

 

0.5

 

 

 

 

 

0.0

 

 

 

 

0.0

 

 

 

 

0.0

 

 

 

 

 

−2

−1

0

1

2

−2

−1

0

1

2

−2

−1

0

1

2

 

 

 

βj

 

 

 

 

βj

 

 

 

 

βj

 

 

An intuitive hint: For small βj , the lasso penalty (λ|βj |) is larger than the ridge penalty (λβj2)

UCL (Statistical Science)

Penalised likelihood methods

13 / 72

 

 

 

Penalised likelihood methods for regression

Methodologies

 

 

 

 

 

Why lasso can shrink small ˆ exactly to be zero while

βj

ridge regression cannot?

Alternative (constrained optimisation) expressions

Ridge regression

 

 

ˆ

=

argmin

`( )

} ,

s.t.

k

 

2

 

t

 

 

 

 

 

 

{ −

 

 

k2

 

 

Lasso

 

ˆ

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

=

argmin

`( )

} , s.t.

k k1 t

 

 

 

 

 

 

{ −

 

 

1.0

q = 2

 

 

 

 

 

1.0

 

 

 

q = 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.5

 

 

 

 

 

 

0.5

 

 

 

 

 

 

 

β

0.0

 

 

 

 

β

0.0

 

 

 

 

 

 

 

2

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

−0.5

 

 

 

 

 

 

−0.5

 

 

 

 

 

 

 

 

−1.0

0.0

 

0.5

1.0

 

 

−1.0

−1.0

−0.5

 

0.0

 

0.5

1.0

 

−1.0 −0.5

 

 

 

 

 

 

 

 

β1

 

 

 

 

 

 

 

 

 

β1

 

 

 

UCL (Statistical Science)

Penalised likelihood methods

14 / 72

 

 

 

Penalised likelihood methods for regression

Methodologies

Why lasso can shrink small ˆ exactly to be zero while

βj

ridge regression cannot?

Linear regression: under certain Gaussian assumptions (ie least squares estimation)

 

 

 

 

X&

 

'

 

 

Xi

&

 

 

'

 

argmin

`( )

}

= argmin

n

T

 

2 = argmin

1

n

 

T

 

 

2

 

{ −

 

i=1

yi − xi

 

 

2

=1

 

yi − xi

 

 

 

 

3

 

2

2

1

β

 

0

 

−1

^

θ

−1.0 −0.5

0.0

0.5

1.0

1.5

2.0

2.5

β1

UCL (Statistical Science)

Penalised likelihood methods

15 / 72

 

 

 

Penalised likelihood methods for regression

Methodologies

 

 

 

 

 

Why lasso can shrink small ˆ exactly to be zero while

βj

ridge regression cannot?

Geometric illustration:

 

 

 

 

ridge

 

 

 

 

 

3

 

 

 

 

 

 

 

3

 

2

 

 

 

θ^

 

 

 

2

2

1

 

 

 

 

 

 

2

1

β

 

 

 

 

 

 

β

 

0

 

 

 

 

 

 

 

0

 

−1

 

 

 

 

 

 

 

−1

 

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

 

 

 

 

 

 

β1

 

 

 

 

 

 

 

lasso

 

 

 

 

 

 

 

● θ^

 

 

 

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

 

 

 

 

β1

 

 

 

UCL (Statistical Science)

Penalised likelihood methods

16 / 72

 

 

 

Penalised likelihood methods for regression Methodologies

Lq-penalty: k kqq = |β1|q + |β2|q

 

 

 

 

 

 

q = 2

 

 

 

 

 

 

q = 1

 

 

 

 

 

 

 

 

 

1.0

 

 

 

 

 

 

1.0

 

 

 

 

 

 

 

 

 

 

 

0.5

 

 

 

 

 

 

0.5

 

 

 

 

 

 

 

 

 

 

β

0.0

 

 

 

 

 

β

0.0

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

−0.5

 

 

 

 

 

 

−0.5

 

 

 

 

 

 

 

 

 

 

 

−1.0

−0.5

0.0

0.5

1.0

 

 

−1.0

−0.5

0.0

0.5

1.0

 

 

 

 

 

 

 

−1.0

 

 

−1.0

 

 

 

 

 

 

 

 

 

β1

 

 

 

 

 

 

β1

 

 

 

 

 

 

 

 

q = 4

 

 

 

 

 

q = 3

 

 

 

 

 

 

q = 1

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

2

 

 

 

1.0

 

 

 

 

 

1.0

 

 

 

 

 

 

1.0

 

 

 

 

 

0.5

 

 

 

 

 

0.5

 

 

 

 

 

 

0.5

 

 

 

 

β

0.0

 

 

 

 

β

0.0

 

 

 

 

 

β

0.0

 

 

 

 

2

 

 

 

 

 

2

 

 

 

 

 

 

2

 

 

 

 

 

 

−0.5

 

 

 

 

 

−0.5

 

 

 

 

 

 

−0.5

 

 

 

 

 

−1.0

−0.5

0.0

0.5

1.0

 

−1.0

−0.5

0.0

 

0.5

1.0

 

−1.0

−0.5

0.0

0.5

1.0

 

−1.0

 

−1.0

 

 

−1.0

 

 

 

β1

 

 

 

 

 

β1

 

 

 

 

 

 

β1

 

 

UCL (Statistical Science)

Penalised likelihood methods

17 / 72

 

 

 

Penalised likelihood methods for regression

Methodologies

 

 

 

 

 

A generalisation: bridge regression & Lq-penalty

Bridge regression (Frank and Friedman 1993; Fu 1998)

 

 

=

 

 

#

k kq

 

ˆ

 

argmin

 

`( ) + λ

q

 

 

 

 

 

 

Lq-penalty k kqq

 

 

p

j |q

for q ≥ 0

 

= Pj=1

 

@ Frank, I.E. and Friedman, J.H. (1993) A statistical view of some chemometrics regression tools (with discussion). Technometrics 35(2):109-148.

@ Fu, W.J. (1998) Penalized regression: The bridge versus the lasso. Journal of Computational and Graphical Statistics 7(3):397-416.

UCL (Statistical Science)

Penalised likelihood methods

18 / 72

 

 

 

Penalised likelihood methods for regression

Methodologies

Bridge regression

 

 

 

q = 4

 

 

 

 

 

q = 2

 

 

 

 

 

q = 3

 

 

 

 

 

q = 1

 

 

 

 

 

q = 1

 

 

 

 

 

q = 0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

1.0

 

 

 

 

 

1.0

 

 

 

 

 

1.0

 

 

 

 

 

1.0

 

 

 

 

 

1.0

 

 

 

 

 

1.0

 

 

 

 

 

0.5

 

 

 

 

 

0.5

 

 

 

 

 

0.5

 

 

 

 

 

0.5

 

 

 

 

 

0.5

 

 

 

 

 

0.5

 

 

 

 

β

0.0

 

 

 

 

β

0.0

 

 

 

 

β

0.0

 

 

 

 

β

0.0

 

 

 

 

β

0.0

 

 

 

 

β

0.0

 

 

 

 

2

 

 

 

 

 

2

 

 

 

 

 

2

 

 

 

 

 

2

 

 

 

 

 

2

 

 

 

 

 

2

 

 

 

 

 

 

−0.5

 

 

 

 

 

−0.5

 

 

 

 

 

−0.5

 

 

 

 

 

−0.5

 

 

 

 

 

−0.5

 

 

 

 

 

−0.5

 

 

 

 

 

−1.0

−0.5

0.0

0.5

1.0

 

−1.0

−0.5

0.0

0.5

1.0

 

−1.0

−0.5

0.0

0.5

1.0

 

−1.0

−0.5

0.0

0.5

1.0

 

−1.0

−0.5

0.0

0.5

1.0

 

−1.0

−0.5

0.0

0.5

1.0

 

−1.0

 

−1.0

 

−1.0

 

−1.0

 

−1.0

 

−1.0

 

 

 

β1

 

 

 

 

 

β1

 

 

 

 

 

β1

 

 

 

 

 

β1

 

 

 

 

 

β1

 

 

 

 

 

β1

 

 

example

select features

 

 

ˆ

 

convexity

 

shrink larger βj

 

 

q > 1

ridge

 

 

 

 

larger

strictly convex

q = 1

lasso

 

X

constant (orthogonal design)

 

convex

 

q < 1

 

 

X

smaller (zero perfectly)

 

non-convex

 

 

ridge

 

 

lasso

 

 

 

subset selection

 

 

5

 

 

5

 

 

 

5

 

 

^ β*

0

 

^ β*

0

 

 

^ β*

0

 

 

 

−5

 

 

−5

 

 

 

−5

 

 

 

−5

0

5

−5

0

5

 

−5

0

5

 

 

^

 

 

^

 

 

 

^

 

 

 

βLS

 

 

βLS

 

 

 

βLS

 

UCL (Statistical Science)

Penalised likelihood methods

 

 

 

19 / 72

 

Penalised likelihood methods for regression Methodologies

 

 

 

 

Selecting correlated features: elastic net & (L1-penalty + L2-penalty)

A ‘drawback’ with lasso: lasso cannot select all of highly-correlated features; L1 penalty is convex but not strictly convex

Elastic net (Zou and Hastie 2005)

1k k1

2k k2

 

=

 

#

 

ˆ

argmin

 

`( ) + λ

 

+ λ

2

 

 

 

 

A generalisation of lasso and ridge regression

L1 penalty for feature selection

L2 penalty for selection of highly-correlated features

Side e

ect: double shrinkage (introducing extra bias)

(Corrected) elastic net: ˆ = ˆ (1 + λ2)

Intuition: cancel the shrinkage (

1

) executed by ridge regression

 

 

 

1+λ2

@Zou, H. and Hastie, T. (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67(2):301-320.

UCL (Statistical Science)

Penalised likelihood methods

20 / 72