Chapter 5 Shrinkage Methods
5.1 Ridge and Lasso Regression
Performing Ridge Regression
To perform ridge regression, we use the package glmnet
.
library(glmnet)
Important Note: glmnet()
uses different syntax than regsubsets()
.
Continuing example with Hitters
data,
= Hitters$Salary
y # Here we exclude the first column because it corresponds to the intercept.
= model.matrix(Salary ~ ., Hitters)[,-1] x
Note that model.matrix(Salary ~ ., Hitters)[,-1]
is exactly the same as Hitters[,-19]
(Salary
has index 19), it just is all the data for the prediction variables.
Performing ridge regression just involves using the glmnet()
function specifying alpha = 0
(for ridge).
= glmnet(x, y, alpha = 0) ridge
To use a different number of \(\lambda\), change parameter nlambda
which is 100 by default.
Extracting Information
names(ridge)
## [1] "a0" "beta" "df" "dim" "lambda" "dev.ratio"
## [7] "nulldev" "npasses" "jerr" "offset" "call" "nobs"
ridge$beta
contains the values of the coefficients under each \(\lambda\) (useridge$beta[,1:3]
to see first 3 for example).coef(ridge)
contains exact same information asridge$beta
but also includes intercept estimates.ridge$lambda
contains the grid of all \(\lambda\) values that were used.
5.1.1 Performing Lasso regression
Very similar to ridge but with alpha = 1
in glmnet()
, which is the default so doesn’t need to be specified.
= Hitters$Salary
y = model.matrix(Salary~., Hitters)[,-1] x
= glmnet(x, y) lasso
Plotting Regularisation Paths
xvar = 'lambda'
specifies to plot coefficient values against Log \(\lambda\), otherwise the default xvar
is the \(L_1\) Norm. Below are two examples, it is the exact same method for ridge and lasso
Ridge Coefficients Against Log \(\lambda\)
plot(ridge, xvar = 'lambda')
Lasso Coefficients Against \(L_1\) Norm
plot(lasso)
Cross-Validation
To find best \(\lambda\) with cross-validation, use cv.glmnet()
instead.
= cv.glmnet(x, y, alpha=0)
ridge.cv names(ridge.cv)
## [1] "lambda" "cvm" "cvsd" "cvup" "cvlo"
## [6] "nzero" "call" "name" "glmnet.fit" "lambda.min"
## [11] "lambda.1se" "index"
ridge.cv$lambda.min
gives the optimal \(\lambda\).ridge.cv$lambda.1se
gives the maximum \(\lambda\) 1 standard-error away from optimal lambda.
Plotting
In the plot below, the left dotted line highlights value of lambda.min
and the right dotted line hightlights value of lambda.1se
.
plot(ridge.cv)
abline( h = ridge.cv$cvup[ridge.cv$index[1]], lty = 4 )
To add these to the plots of coefficients against Log \(\lambda\),
plot(ridge, xvar = 'lambda')
abline(v = log(ridge.cv$lambda.min), lty = 3) # careful to use the log here and below
abline(v = log(ridge.cv$lambda.1se), lty = 3)
5.1.2 Comparing Predictive Performance For Different \(\lambda\)s
See another example of this method applied to BSS, Ridge, Lasso and PCR at end of this chapter.
= 50
repetitions .1 = c()
mse.2 = c()
mse
set.seed(1)
for(i in 1:repetitions){
# Step (i) random data splitting
= sample(1:263, 175)
training.obs = Hitters$Salary[training.obs]
y.train = model.matrix(Salary~., Hitters[training.obs, ])[,-1]
x.train = Hitters$Salary[-training.obs]
y.test = model.matrix(Salary~., Hitters[-training.obs, ])[,-1]
x.test
# Step (ii) training phase
= cv.glmnet(x.train, y.train)
lasso.train
# Step (iii) generating predictions
.1 = predict(lasso.train, x.test, s = 'lambda.min')
predict.2 = predict(lasso.train, x.test, s = 'lambda.1se')
predict
# Step (iv) evaluating predictive performance
.1[i] = mean((y.test-predict.1)^2)
mse.2[i] = mean((y.test-predict.2)^2)
mse
}
boxplot(mse.1, mse.2, names = c('min-CV lasso','1-se lasso'),
ylab = 'Test MSE', col = 7)
5.2 Principal Component Analysis
This section contains many examples from practical 3. We use seatpos
(38x9 dimension and no missing values) data from faraway
package.
library(faraway)
To perform principal component analysis without cross-validation, use prcomp()
function.
Seperate out response and prediction variables into x
and y
<- seatpos$hipcenter
y <- model.matrix(hipcenter ~ ., seatpos)[,-1] x
Note that model.matrix(hipcenter ~ ., seatpos)[,-1]
is equivalent to seatpos[,-9]
.
<- prcomp(x, scale=TRUE) seatpos.pr
Scaling
- scale = TRUE
scales each variable by it’s standard error which is needed for meaningful inference. However, this can also be done manually,
- To manually scale before running prcomp
, divide by column standard errors,
<- apply(x, 2, sd) # calculates the column SDs
s <- sweep(x, 2, s, "/") # divides all columns by their SDs x.s
Extracting Data
- Variance:
seatpos.pr$sdev
gives the standard deviation of each component (seatpos.pr$sdev^2
for variance) - Eigenvectors:
seatpos.pr$rotation
gives the eigenvectors of each component
These two are equivalent to calling ...$values
and ...$vectors
on eigen(var(seatpos.s))
where seatpos.s
is seatpos
manually scaled (alternatively, don’t scale when calling prcom()
).
Scree Plots and Proportion of Variance Explained
To manually calculate the proportion of variance each PC explains use,
$sdev^2 / sum(seatpos.pr$sdev^2 ) seatpos.pr
## [1] 0.7091482088 0.1545982512 0.0579679303 0.0301197935 0.0242774455
## [6] 0.0173967750 0.0062930659 0.0001985298
Or can directly read off the summary
(including cumulative proportions),
summary(seatpos.pr)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3818 1.1121 0.68099 0.49088 0.44070 0.3731 0.22438
## Proportion of Variance 0.7091 0.1546 0.05797 0.03012 0.02428 0.0174 0.00629
## Cumulative Proportion 0.7091 0.8638 0.92171 0.95183 0.97611 0.9935 0.99980
## PC8
## Standard deviation 0.03985
## Proportion of Variance 0.00020
## Cumulative Proportion 1.00000
This is the information a Scree Plot shows which can be plotted straight from the call to prcomp()
,
plot(seatpos.pr)
Data Compression (Projection)
Scale data beforehand using manual scaling (see above),
<- sweep(x, 2, apply(x, 2, sd) , "/") #scaled data
x.s <- prcomp(x.s) seatpos.pr
Compress data to \(k\) PCs, calculate their means and then reconstruct the data (with error),
<- t(seatpos.pr$x[,c(1,2,3,4)]) # Compressed using 4 PCs
T <- colMeans(x.s) # calculates means of scaled data set
ms <- t(ms + seatpos.pr$rot[,c(1,2,3,3)]%*% T) # reconstruction R
plot(rbind(x.s[,1:2], R[,1:2]), col=c(rep(4,38),rep(2,38)))
Original data is blue, reconstructions are red
Note: The above only plots two variables, we can plot all pairs using
pairs(rbind(x.s, R), col=c(rep(4,38),rep(2,38)))
Principal Component Regression
To perform principal component regression we use pls
package.
library(pls)
PCR is done with the pcr(function)
. We continue with Hitters
data,
= pcr( Salary ~ ., data = Hitters, scale = TRUE, validation = "CV" )
pcr.fit summary(pcr.fit)
## Data: X dimension: 263 19
## Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 452 352.5 351.6 352.3 350.7 346.1 345.5
## adjCV 452 352.1 351.2 351.8 350.1 345.5 344.6
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 345.4 348.5 350.4 353.2 354.5 357.5 360.3
## adjCV 344.5 347.5 349.3 351.8 353.0 355.8 358.5
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 352.4 354.3 345.6 346.7 346.6 349.4
## adjCV 350.2 352.3 343.6 344.5 344.3 346.9
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 38.31 60.16 70.84 79.03 84.29 88.63 92.26 94.96
## Salary 40.63 41.58 42.17 43.22 44.90 46.48 46.69 46.75
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 96.28 97.26 97.98 98.65 99.15 99.47 99.75
## Salary 46.86 47.76 47.82 47.85 48.10 50.40 50.55
## 16 comps 17 comps 18 comps 19 comps
## X 99.89 99.97 99.99 100.00
## Salary 53.01 53.85 54.61 54.61
Optimal Number of PCs
This is not built in so can be read off from summary above or copy and paste this code,
= which.min( MSEP( pcr.fit )$val[1,1, ] ) - 1
min.pcr min.pcr
## 7 comps
## 7
MSE Validation Plot
validationplot( pcr.fit, val.type = 'MSEP' )
Coefficients of Optimal Number of PCs
coef(pcr.fit, ncomp = min.pcr)
## , , 7 comps
##
## Salary
## AtBat 27.005477
## Hits 28.531195
## HmRun 4.031036
## Runs 29.464202
## RBI 18.974255
## Walks 47.658639
## Years 24.125975
## CAtBat 30.831690
## CHits 32.111585
## CHmRun 21.811584
## CRuns 34.054133
## CRBI 28.901388
## CWalks 37.990794
## LeagueN 9.021954
## DivisionW -66.069150
## PutOuts 74.483241
## Assists -3.654576
## Errors -6.004836
## NewLeagueN 11.401041
Similarly, familiar functions like predict
can also be used on pcr.fit
and has an ncomp
parameter.
Regularisation Path Plots
Full explanation of code is at the end of section 7.6.
= matrix(NA, 19, 19)
coef.mat for(i in 1:19){
= pcr.fit$coefficients[,,i]
coef.mat[,i]
}
plot(coef.mat[1,], type = 'l', ylab = 'Coefficients',
xlab = 'Number of components', ylim = c(min(coef.mat), max(coef.mat)))
for(i in 2:19){
lines(coef.mat[i,], col = i)
}
abline(v = min.pcr, lty = 3)
Scree Plots with PCR (manually)
<- rep(NA,19)
PVE for(i in 1:19){ PVE[i]<- sum(pcr.fit$Xvar[1:i])/pcr.fit$Xtotvar }
barplot( PVE, names.arg = 1:19, main = "scree plot",
xlab = "number of PCs",
ylab = "proportion of variance explained" )
5.3 Comparing Predictive Performances
Example for practical 3. Uses leaps
, pls
and glmnet
packages and we’re give we want to split data training:test as 28:10.
We also need to include predict.regsubsets
from chapter 4.
= function(object, newdata, id, ...){
predict.regsubsets = as.formula(object$call[[2]])
form = model.matrix(form, newdata)
mat = coef(object, id = id)
coefi = names(coefi)
xvars %*%coefi
mat[, xvars] }
We want to calculate the correlation and MSE for the four models BSS (\(C_p\)), Ridge (5 folds CV), Lasso (5 fold CV) and PCR over 50 repetitions.
= c()
cor.bss = c()
cor.ridge = c()
cor.lasso = c()
cor.pcr = c()
mse.bss = c()
mse.ridge = c()
mse.lasso = c()
mse.pcr
for(i in 1:repetitions){
# Step (i) data splitting
= sample(1:38, 28)
training.obs = seatpos$hipcenter[training.obs]
y.train = model.matrix(hipcenter ~ ., seatpos[training.obs, ])[,-1]
x.train = seatpos$hipcenter[-training.obs]
y.test = model.matrix(hipcenter ~ ., seatpos[-training.obs, ])[,-1]
x.test
# Step (ii) training phase
= regsubsets(hipcenter ~ ., seatpos, nvmax = 8)
bss.train = which.min(summary(bss.train)$cp)
min.cp
= cv.glmnet(x.train, y.train, alpha = 0, nfold=5)
ridge.train = cv.glmnet(x.train, y.train, nfold=5)
lasso.train
= pcr(hipcenter ~ ., data = seatpos[training.obs,],
pcr.train scale = TRUE, validation = "CV" )
= which.min(MSEP(pcr.train)$val[1,1, ] ) - 1
min.pcr
# Step (iii) generating predictions
= predict.regsubsets(bss.train, seatpos[-training.obs, ], min.cp)
predict.bss = predict(ridge.train, x.test, s = 'lambda.min')
predict.ridge = predict(lasso.train, x.test, s = 'lambda.min')
predict.lasso = predict(pcr.train, seatpos[-training.obs, ], ncomp = min.pcr )
predict.pcr
# Step (iv) evaluating predictive performance
= mean((y.test-predict.bss)^2)
mse.bss[i] = mean((y.test-predict.ridge)^2)
mse.ridge[i] = mean((y.test-predict.lasso)^2)
mse.lasso[i] = mean((y.test-predict.pcr)^2)
mse.pcr[i]
= cor(y.test, predict.bss)
cor.bss[i] = cor(y.test, predict.ridge)
cor.ridge[i] = cor(y.test, predict.lasso)
cor.lasso[i] = cor(y.test, predict.pcr)
cor.pcr[i] }
We can then compare with boxplots
boxplot(mse.bss, mse.ridge, mse.lasso, mse.pcr, names = c('BSS','Ridge','Lasso','PCR'),
ylab = 'Test MSE', col = 2:5)
boxplot(cor.bss, cor.ridge, cor.lasso, cor.pcr, names = c('BSS','Ridge','Lasso','PCR'),
ylab = 'Test Correlation', col = 2:5)