Applied regression analysis / Лекции / Лекция_6
.pdfPenalised likelihood methods for high-dimensional pattern analysis
O.V. Krasotkina
Department of Cybernatics Science, Tula State University
11/11/2012. ICPR2012 Tutorial PM-02
UCL (Statistical Science) |
Penalised likelihood methods |
1 / 72 |
|
|
|
|
|
|
|
|
|
Outline
1Penalised likelihood methods for regression
Methodologies
Bayesian counterparts
Computations
2Penalised likelihood methods for classification
Penalised linear logistic regression
Penalised linear discriminant analysis (LDA)
Penalised support vector machines (SVM)
3Penalised likelihood methods for clustering
Penalised model-based clustering
UCL (Statistical Science) |
Penalised likelihood methods |
2 / 72 |
|
|
|
Penalised likelihood methods for regression |
Methodologies |
Penalised likelihood methods
Classical methods
L0-penalty: AIC (1974); BIC (1978); Best subset selection L2-penalty: Ridge regression (1970)
Modern methods
L1-penalty: Lasso (1996)
Lq-penalty: Bridge regression (1993, 1998)
(L1-penalty + L2-penalty): Elastic net (2005)
‘Adaptive’ penalty: Relaxed lasso (2007); Adaptive lasso (2006, 2009); SCAD (2001)
‘Structured’ penalty: Fused lasso (2005, 2011); Group lasso (2006, 2012)
Stable penalised methods: Bootstrap + lasso (2008, 2010)
UCL (Statistical Science) |
Penalised likelihood methods |
3 / 72 |
|
|
|
|
|
Penalised likelihood methods for regression |
Methodologies |
|
|
|
|
|
|
Regression
Response variable: y
Explanatory variables (features): x = (x1, . . . , xp)T , p-dimensional
Regression model: y = f (x, ) +
f (x, ): eg often assume a linear function f (x, ) = β1x1 + · · · + βpxp
: unknown parameters, eg = ( )T
β1, . . . , p β
: random error term; if E ( |x) = 0, then E (y|x) = f (x, )
Parameter estimation (learning): ˆ
Observed training sample: {(yi , xi )}ni=1, a sample of n instances
UCL (Statistical Science) |
Penalised likelihood methods |
4 / 72 |
|
|
|
Penalised likelihood methods for regression |
Methodologies |
Parameter estimation
Given n training instances {(yi , xi )}ni=1, we estimate by
Least squares estimation: select that minimises the sum of squared
residuals: |
n |
ˆLS |
= argmin X{yi − f (xi , )}2 |
i=1
Maximum likelihood estimation (MLE): select that maximises a
likelihood function L( ) (assumed concave here for simplicity)
|
|
|
|
|
|
iid |
|
|
n |
|
|
|
|
|
|
|
|
|
|
|
|
||
L( ) = p(y1, . . . , yn|x1, . . . , xn, ) = |
|
i=1 p(yi |xi , ) |
|
|
|
|
|
|
|
|
|
|
|||||||||||
The distribution p(y|x |
ˆ |
|
largest probability of producing |
||||||||||||||||||||
, |
|
|
|||||||||||||||||||||
|
ML) has the Q |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
the observed sample: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
n |
|
|
|
|
|
|
|
|
|
|
|
ˆ |
argmax `( ) = argmax log |
L |
( ) = argmax |
Xi |
log p(y |
i | |
x |
, ) |
|||||||||||||||
|
ML = |
|
|
||||||||||||||||||||
|
|
|
|
|
|
|
|
=1 |
|
|
|
|
|
|
i |
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||
UCL (Statistical Science) |
|
Penalised likelihood methods |
|
|
|
|
|
|
|
|
|
|
5 / 72 |
||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||
Penalised likelihood methods for regression |
Methodologies |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Model selection
“Essentially, all models are wrong, but some are useful” – George Box
Aims: Given a set of candidate models, we want to select an optimal model
For di erent candidate models,is composed of di βj s; erenteg β1 = 0 and βp 6= 0 for Model A, while β1 = 0 and βp = 0 for Model B
|
ˆ(A) |
ˆ(B) |
Let us be intuitive and slightly naive, compare `( ML ) and `( ML ); ie |
||
ˆ |
ˆ |
|
|
= argmax `( ML) |
|
ˆ
ML
Problems with more features selected
overfitting
worse prediction for new instances less interpretable model
Solution: Select fewer redundant explanatory variables
Penalise larger models with more unknown parameters
UCL (Statistical Science) |
Penalised likelihood methods |
6 / 72 |
|
|
|
Penalised likelihood methods for regression |
Methodologies |
Selecting features: AIC, BIC & L0-penalty
Akaike information criterion (AIC; Akaike 1974) |
|
|
|
||||||||||
AIC (ˆ |
|
) = |
− |
`(ˆ |
|
) + k; ˆ = argmin AIC (ˆ |
) , |
|
|||||
ML |
|
|
ML |
|
|
|
ˆ |
ML |
|
|
|||
|
|
|
|
|
|
|
|
|
|
ML |
|
|
|
|
|
|
|
|
|
|
|
|
|
ˆ |
|
|
|
where k is the number of parameters in ML |
|
|
|
||||||||||
Bayesian information criterion (BIC; Schwarz 1978) |
|
|
|
||||||||||
BIC (ˆ |
) = |
− |
`(ˆ |
) + |
log n |
k; |
ˆ = argmin BIC (ˆ |
) |
|||||
|
|||||||||||||
ML |
|
|
|
ML |
|
2 |
|
ˆ |
|
ML |
|
||
|
|
|
|
|
|
|
|
|
|
ML |
|
|
|
@Akaike, H. (1974) A new look at the statistical model identification. IEEE Trans. Automatic Control 19(6):716-723.
@Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics 6(2):461-464.
UCL (Statistical Science) |
Penalised likelihood methods |
7 / 72 |
|
|
|
|
|
Penalised likelihood methods for regression |
Methodologies |
|
|
|
|
|
|
Selecting features: AIC, BIC & L0-penalty
ˆ |
ˆ |
|
|
|
|
|
|
|
|
|
|
|
|
|
AIC ( ML) = −`( ML) + k |
|
|
|
|
|
|
|
|
|
|
|
|
||
|
ˆ = argmin AIC (ˆML) |
|
|
|
||||||||||
|
|
|
|
|
ˆ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ML |
|
|
|
|
|
|
|
|
|
ˆ |
ˆ |
log n |
k |
|
|
|
|
|
|
|
|
|
||
BIC ( ML) = −`( ML) + |
2 |
|
|
|
|
|
|
|
|
|
|
|||
|
ˆ = argmin BIC (ˆML) |
|
|
|
||||||||||
|
|
|
|
|
ˆ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ML |
|
|
|
|
|
|
|
|
|
ˆ |
= k k0 = |β1| |
0 |
+ . . . + |βp| |
0 |
, given 0 |
0 |
= 0. Let us be |
|||||||
k = k MLk0 |
|
|
|
|||||||||||
intuitive and slightly naive again, |
|
|
|
|
|
|
|
|
||||||
|
ˆ = argmin |
`( ) + λ |
k |
|
k0 |
} |
, |
|
||||||
|
|
|
|
|
|
{ − |
|
|
|
|
|
where λ ≥ 0. Problems with this ‘best subset selection’?
UCL (Statistical Science) |
Penalised likelihood methods |
8 / 72 |
|
|
|
Penalised likelihood methods for regression |
Methodologies |
‘Shrinking’ features: ridge regression & L2-penalty
Problems with ˆ = argmin |
{ − |
k |
|
k0} |
`( ) + λ |
|
|
Feasibility: large p; high-dimensional ‘large-p-small-n’ data
Variability of selected models: discrete selection
Ridge regression (Hoerl and Kennard 1970)
|
|
|
|
|
= |
|
|
#− |
k k2 |
|
||
|
|
|
|
ˆ |
|
argmin |
|
`( ) + λ |
2 |
|
||
|
|
|
|
|
|
P |
|
|
|
|||
L2 |
|
k k2 |
: |
|
|
|
|
|
|
|
|
|
L2 |
-penalty k k22 |
|
= |
p |
|
βj2: strictly convex; di |
|
erentiable |
||||
|
j=1 |
|
||||||||||
|
-penalty |
2 |
|
continuous |
|
|
|
|
@Hoerl, A.E. and Kennard, R.W. (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 42(1):80-86.
UCL (Statistical Science) |
|
Penalised likelihood methods |
|
|
|
|
|
9 / 72 |
||||||||||
|
|
|
|
|
|
|
||||||||||||
Penalised likelihood methods for regression |
Methodologies |
|
|
|
|
|
|
|||||||||||
‘Shrinking’ features: ridge regression & L2-penalty |
||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
Ridge regression: ˆ = argmin |
#− |
`( ) + λ |
|
|
2 |
|
|
|
||||||||||
Shrink βˆj towards zero |
|
|
|
|
|
k |
|
k2 |
|
|
|
|||||||
For a linear regression model yi = xiT + i , ridge regression becomes |
||||||||||||||||||
|
|
ˆ |
= |
|
|
|
|
nk − k2 |
|
k k2o |
|
|||||||
|
|
|
|
|
argmin |
y X |
|
|
+ λ |
|
, |
|||||||
where y = (y1, . . . , yn)T |
and X is the design matrix (x1, . . . , xn)T |
|||||||||||||||||
Closed-form solution: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
ˆ |
T |
X) |
−1 |
X |
T |
y |
(Least squares) |
|
|
|||||||||
λ = 0: LS = (X |
|
|
|
|
|
|
0: ˆ = (XT X + I)−1XT y
λ > λ
Further assume orthonormal design XT X = I:
ˆ |
= |
1 |
ˆ |
|
1 + λ |
LS |
The tuning parameter λ: the larger λ, the greater shrinkage
Problems with ridge regression?
UCL (Statistical Science) |
Penalised likelihood methods |
10 / 72 |
|
|
|
Penalised likelihood methods for regression |
Methodologies |
Selecting and ‘shrinking’ features: lasso & L1-penalty
Lasso: Least absolute shrinkage and selection operator (Tibshirani 1996)
|
|
|
|
ˆ = argmin |
`( ) + λ |
k |
|
k1} |
|
||||||
|
|
|
|
|
|
|
|
|
|
|
{ − |
|
|
||
L1 |
-penalty k k1 |
|
= |
|
p |
|βj |: convex |
|
|
|
|
|||||
|
|
j=1 |
|
|
|
|
|||||||||
L |
1 |
-penalty k k1 |
: |
continuous |
|
|
|
|
|
|
|||||
|
|
P |
|
|
|
|
|
|
|
|
|
|
|||
|
|
ˆ |
|
|
|
|
|
|
|
|
|
|
|
|
|
Shrink βj towards zero |
|
|
|
|
|
|
|
|
|
||||||
The tuning parameter λ: the larger λ, the greater shrinkage |
|||||||||||||||
Unlike ridge regression, the lasso can truncate su |
ˆ |
||||||||||||||
βj |
|||||||||||||||
exactly at zero; that is, the lasso can do model selection |
|||||||||||||||
L2 |
-penalty 22 |
|
= |
|
p |
βj2: strictly convex; di |
erentiable |
||||||||
|
P |
=1 |
|||||||||||||
L1 |
k k |
|
|
|
| |
βj |
| |
: convex; singular at βj |
= 0 |
||||||
-penalty k k1 |
|
= Pjpj=1 |
|
|
@Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58(1):267-288.
UCL (Statistical Science) Penalised likelihood methods 11 / 72
Penalised likelihood methods for regression Methodologies
Di erence between usingL2, L1 & L0 penalties
Linear regression: yi = xTi + i
Orthonormal design matrix: XT X = I
Horizontal axis: ˆ , least squares estimator or MLE under certain Gaussian
LS
assumptions
Vertical axis: ˆ , estimator obtained by ridge regression (Left-hand), lasso
(Middle) & subset selection (Right-hand), indicated by solid lines
ridge lasso subset selection
|
5 |
|
5 |
|
5 |
β^* |
0 |
β^* |
0 |
β^* |
0 |
|
−5 |
|
−5 |
|
−5 |
−5 |
0 |
5 |
−5 |
0 |
5 |
−5 |
0 |
5 |
||||
|
^ |
|
|
^ |
|
|
|
^ |
|
|
|
|
|
βLS |
|
|
βLS |
|
|
|
|
βLS |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
UCL (Statistical Science) |
Penalised likelihood methods |
12 / 72 |
|
|
|
Penalised likelihood methods for regression |
Methodologies |
Why lasso can shrink small ˆ exactly to be zero while
βj
ridge regression cannot?
ridge |
lasso |
|
2.0 |
|
|
|
|
2.0 |
|
|
|
|
2.0 |
|
|
|
|
|
1.5 |
|
|
|
|
1.5 |
|
|
|
|
1.5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
j |
|
|
|
|
|
j |
1.0 |
|
|
|
j |
1.0 |
|
|
|
&β |
1.0 |
|
|
|
|
β2 |
|
|
|
β |
|
|
|
2 β |
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
j |
|
|
|
|
|
|
0.5 |
|
|
|
|
0.5 |
|
|
|
|
0.5 |
|
|
|
|
|
0.0 |
|
|
|
|
0.0 |
|
|
|
|
0.0 |
|
|
|
|
|
−2 |
−1 |
0 |
1 |
2 |
−2 |
−1 |
0 |
1 |
2 |
−2 |
−1 |
0 |
1 |
2 |
|
|
|
βj |
|
|
|
|
βj |
|
|
|
|
βj |
|
|
An intuitive hint: For small βj , the lasso penalty (λ|βj |) is larger than the ridge penalty (λβj2)
UCL (Statistical Science) |
Penalised likelihood methods |
13 / 72 |
|
|
|
|
|
Penalised likelihood methods for regression |
Methodologies |
|
|
|
|
|
|
Why lasso can shrink small ˆ exactly to be zero while
βj
ridge regression cannot?
Alternative (constrained optimisation) expressions
Ridge regression
|
|
ˆ |
= |
argmin |
`( ) |
} , |
s.t. |
k |
|
2 |
|
t |
|
||
|
|
|
|
|
{ − |
|
|
k2 |
|
|
|||||
Lasso |
|
ˆ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
= |
argmin |
`( ) |
} , s.t. |
k k1 t |
|
||||||||
|
|
|
|
|
{ − |
|
|||||||||
|
1.0 |
q = 2 |
|
|
|
|
|
1.0 |
|
|
|
q = 1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
0.5 |
|
|
|
|
|
|
0.5 |
|
|
|
|
|
|
|
β |
0.0 |
|
|
|
|
β |
0.0 |
|
|
|
|
|
|
|
|
2 |
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
|
−0.5 |
|
|
|
|
|
|
−0.5 |
|
|
|
|
|
|
|
|
−1.0 |
0.0 |
|
0.5 |
1.0 |
|
|
−1.0 |
−1.0 |
−0.5 |
|
0.0 |
|
0.5 |
1.0 |
|
−1.0 −0.5 |
|
|
|
|
|
|
||||||||
|
|
β1 |
|
|
|
|
|
|
|
|
|
β1 |
|
|
|
UCL (Statistical Science) |
Penalised likelihood methods |
14 / 72 |
|
|
|
Penalised likelihood methods for regression |
Methodologies |
Why lasso can shrink small ˆ exactly to be zero while
βj
ridge regression cannot?
Linear regression: under certain Gaussian assumptions (ie least squares estimation)
|
|
|
|
X& |
|
' |
|
|
Xi |
& |
|
|
' |
|
argmin |
`( ) |
} |
= argmin |
n |
T |
|
2 = argmin |
1 |
n |
|
T |
|
|
2 |
|
{ − |
|
i=1 |
yi − xi |
|
|
2 |
=1 |
|
yi − xi |
|
|
|
|
3 |
|
2 |
2 |
1 |
β |
|
|
0 |
|
−1 |
^
● θ
−1.0 −0.5 |
0.0 |
0.5 |
1.0 |
1.5 |
2.0 |
2.5 |
β1
UCL (Statistical Science) |
Penalised likelihood methods |
15 / 72 |
|
|
|
|
|
Penalised likelihood methods for regression |
Methodologies |
|
|
|
|
|
|
Why lasso can shrink small ˆ exactly to be zero while
βj
ridge regression cannot?
Geometric illustration:
|
|
|
|
ridge |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
|
3 |
|
2 |
|
|
|
● θ^ |
|
|
|
2 |
2 |
1 |
|
|
|
|
|
|
2 |
1 |
β |
|
|
|
|
|
|
β |
||
|
0 |
|
|
|
|
|
|
|
0 |
|
−1 |
|
|
|
|
|
|
|
−1 |
|
−1.0 |
−0.5 |
0.0 |
0.5 |
1.0 |
1.5 |
2.0 |
2.5 |
|
|
|
|
|
|
β1 |
|
|
|
|
|
|
|
lasso |
|
|
|
|
|
|
|
|
● θ^ |
|
|
|
−1.0 |
−0.5 |
0.0 |
0.5 |
1.0 |
1.5 |
2.0 |
2.5 |
|
|
|
|
β1 |
|
|
|
UCL (Statistical Science) |
Penalised likelihood methods |
16 / 72 |
|
|
|
Penalised likelihood methods for regression Methodologies
Lq-penalty: k kqq = |β1|q + |β2|q
|
|
|
|
|
|
q = 2 |
|
|
|
|
|
|
q = 1 |
|
|
|
|
|
|
|
|
|
1.0 |
|
|
|
|
|
|
1.0 |
|
|
|
|
|
|
|
|
|
|
|
0.5 |
|
|
|
|
|
|
0.5 |
|
|
|
|
|
|
|
|
|
|
β |
0.0 |
|
|
|
|
|
β |
0.0 |
|
|
|
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
−0.5 |
|
|
|
|
|
|
−0.5 |
|
|
|
|
|
|
|
|
|
|
|
−1.0 |
−0.5 |
0.0 |
0.5 |
1.0 |
|
|
−1.0 |
−0.5 |
0.0 |
0.5 |
1.0 |
|
|
|
|
|
|
|
−1.0 |
|
|
−1.0 |
|
|
|
||||||||
|
|
|
|
|
|
β1 |
|
|
|
|
|
|
β1 |
|
|
|
|
|
|
|
|
q = 4 |
|
|
|
|
|
q = 3 |
|
|
|
|
|
|
q = 1 |
|
|
|
|
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
2 |
|
|
|
1.0 |
|
|
|
|
|
1.0 |
|
|
|
|
|
|
1.0 |
|
|
|
|
|
0.5 |
|
|
|
|
|
0.5 |
|
|
|
|
|
|
0.5 |
|
|
|
|
β |
0.0 |
|
|
|
|
β |
0.0 |
|
|
|
|
|
β |
0.0 |
|
|
|
|
2 |
|
|
|
|
|
2 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
−0.5 |
|
|
|
|
|
−0.5 |
|
|
|
|
|
|
−0.5 |
|
|
|
|
|
−1.0 |
−0.5 |
0.0 |
0.5 |
1.0 |
|
−1.0 |
−0.5 |
0.0 |
|
0.5 |
1.0 |
|
−1.0 |
−0.5 |
0.0 |
0.5 |
1.0 |
|
−1.0 |
|
−1.0 |
|
|
−1.0 |
||||||||||||
|
|
|
β1 |
|
|
|
|
|
β1 |
|
|
|
|
|
|
β1 |
|
|
UCL (Statistical Science) |
Penalised likelihood methods |
17 / 72 |
|
|
|
|
|
Penalised likelihood methods for regression |
Methodologies |
|
|
|
|
|
|
A generalisation: bridge regression & Lq-penalty
Bridge regression (Frank and Friedman 1993; Fu 1998)
|
|
= |
|
|
#− |
k kq |
||
|
ˆ |
|
argmin |
|
`( ) + λ |
q |
||
|
|
|
|
|
|
|||
Lq-penalty k kqq |
|
|
p |
|βj |q |
for q ≥ 0 |
|
||
= Pj=1 |
|
@ Frank, I.E. and Friedman, J.H. (1993) A statistical view of some chemometrics regression tools (with discussion). Technometrics 35(2):109-148.
@ Fu, W.J. (1998) Penalized regression: The bridge versus the lasso. Journal of Computational and Graphical Statistics 7(3):397-416.
UCL (Statistical Science) |
Penalised likelihood methods |
18 / 72 |
|
|
|
Penalised likelihood methods for regression |
Methodologies |
Bridge regression
|
|
|
q = 4 |
|
|
|
|
|
q = 2 |
|
|
|
|
|
q = 3 |
|
|
|
|
|
q = 1 |
|
|
|
|
|
q = 1 |
|
|
|
|
|
q = 0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
1.0 |
|
|
|
|
|
1.0 |
|
|
|
|
|
1.0 |
|
|
|
|
|
1.0 |
|
|
|
|
|
1.0 |
|
|
|
|
|
1.0 |
|
|
|
|
|
0.5 |
|
|
|
|
|
0.5 |
|
|
|
|
|
0.5 |
|
|
|
|
|
0.5 |
|
|
|
|
|
0.5 |
|
|
|
|
|
0.5 |
|
|
|
|
β |
0.0 |
|
|
|
|
β |
0.0 |
|
|
|
|
β |
0.0 |
|
|
|
|
β |
0.0 |
|
|
|
|
β |
0.0 |
|
|
|
|
β |
0.0 |
|
|
|
|
2 |
|
|
|
|
|
2 |
|
|
|
|
|
2 |
|
|
|
|
|
2 |
|
|
|
|
|
2 |
|
|
|
|
|
2 |
|
|
|
|
|
|
−0.5 |
|
|
|
|
|
−0.5 |
|
|
|
|
|
−0.5 |
|
|
|
|
|
−0.5 |
|
|
|
|
|
−0.5 |
|
|
|
|
|
−0.5 |
|
|
|
|
|
−1.0 |
−0.5 |
0.0 |
0.5 |
1.0 |
|
−1.0 |
−0.5 |
0.0 |
0.5 |
1.0 |
|
−1.0 |
−0.5 |
0.0 |
0.5 |
1.0 |
|
−1.0 |
−0.5 |
0.0 |
0.5 |
1.0 |
|
−1.0 |
−0.5 |
0.0 |
0.5 |
1.0 |
|
−1.0 |
−0.5 |
0.0 |
0.5 |
1.0 |
|
−1.0 |
|
−1.0 |
|
−1.0 |
|
−1.0 |
|
−1.0 |
|
−1.0 |
||||||||||||||||||||||||
|
|
|
β1 |
|
|
|
|
|
β1 |
|
|
|
|
|
β1 |
|
|
|
|
|
β1 |
|
|
|
|
|
β1 |
|
|
|
|
|
β1 |
|
|
example |
select features |
|
|
ˆ |
|
convexity |
|
|||
shrink larger βj |
|
|
||||||||
q > 1 |
ridge |
|
|
|
|
larger |
strictly convex |
|||
q = 1 |
lasso |
|
X |
constant (orthogonal design) |
|
convex |
|
|||
q < 1 |
|
|
X |
smaller (zero perfectly) |
|
non-convex |
||||
|
|
ridge |
|
|
lasso |
|
|
|
subset selection |
|
|
5 |
|
|
5 |
|
|
|
5 |
|
|
^ β* |
0 |
|
^ β* |
0 |
|
|
^ β* |
0 |
|
|
|
−5 |
|
|
−5 |
|
|
|
−5 |
|
|
|
−5 |
0 |
5 |
−5 |
0 |
5 |
|
−5 |
0 |
5 |
|
|
^ |
|
|
^ |
|
|
|
^ |
|
|
|
βLS |
|
|
βLS |
|
|
|
βLS |
|
UCL (Statistical Science) |
Penalised likelihood methods |
|
|
|
19 / 72 |
|||||
|
Penalised likelihood methods for regression Methodologies |
|
|
|
|
Selecting correlated features: elastic net & (L1-penalty + L2-penalty)
A ‘drawback’ with lasso: lasso cannot select all of highly-correlated features; L1 penalty is convex but not strictly convex
Elastic net (Zou and Hastie 2005) |
1k k1 |
2k k2 |
|||||
|
= |
|
#− |
|
|||
ˆ |
argmin |
|
`( ) + λ |
|
+ λ |
2 |
|
|
|
|
|
A generalisation of lasso and ridge regression
L1 penalty for feature selection
L2 penalty for selection of highly-correlated features
Side e |
ect: double shrinkage (introducing extra bias) |
||
(Corrected) elastic net: ˆ = ˆ (1 + λ2) |
|||
Intuition: cancel the shrinkage ( |
1 |
) executed by ridge regression |
|
|
|||
|
|
1+λ2 |
@Zou, H. and Hastie, T. (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67(2):301-320.
UCL (Statistical Science) |
Penalised likelihood methods |
20 / 72 |
|
|
|