Chapter 2 solutions
Q1: run the same model again but without standardising. What do you observe?
As before, we run
set.seed(2)
X <- matrix(rnorm(100 * 20), 100, 20)
beta <- c(rep(5,5),rep(0,15))
y <- X%*%beta + rnorm(100)
library(glmnet)
fit <- glmnet(x = X, y = y, family = "gaussian", nlambda = 20, lambda.min.ratio = 0.1)
fit_ns <- glmnet(x = X, y = y, family = "gaussian", nlambda = 20, lambda.min.ratio = 0.1, standardize = FALSE)
cbind(fit$beta[,20],fit_ns$beta[,20])
glmnet
, to let the package handle it).
Q2: what happens if we do not specify how many 's we want?
If we remove the lambda options:
set.seed(2)
X <- matrix(rnorm(100 * 20), 100, 20)
beta <- c(rep(5,5),rep(0,15))
y <- X%*%beta + rnorm(100)
library(glmnet)
fit <- glmnet(x = X, y = y, family = "gaussian")
length(fit$lambda)
fit$dev.ratio
) is close to 1, which happened here.
Q3 (optional): the glmnet documentation states that there is a parameter that we can change. What does this do?**
The glmnet
function actually runs the elastic net model, which is defined by
It uses a combination of the lasso and ridge (similar to how SGL combines the lasso and group lasso), balanced through the parameter. By default, is set to 1, so that it reduces to the lasso, which is why we have not had to worry about it so far. However, elastic net has been proposed as an extension to the lasso which overcomes many of its issues, so we compare their performances here.
set.seed(2)
X <- matrix(rnorm(100 * 20), 100, 20)
beta <- c(rep(5,5),rep(0,15))
y <- X%*%beta + rnorm(100)
set.seed(3)
X_new <- matrix(rnorm(10 * 20), 10, 20)
y_new <- X_new%*%beta + rnorm(10)
library(glmnet)
alpha_seq = seq(0,1,length.out=20)
preds = data.frame(alpha = alpha_seq, error = rep(0,length(alpha_seq)))
for (i in 1:length(alpha_seq)){
set.seed(2)
fit.cv <- cv.glmnet(x = X, y = y, family = "gaussian", nlambda = 20, lambda.min.ratio = 0.1, alpha = alpha_seq[i])
preds$error[i] = mean((y_new-predict(object = fit.cv, newx = X_new, s = "lambda.1se"))^2)
}
plot(preds$alpha, preds$error, type = "b", ylab="Error")
Clearly the model performs badly for , which is ridge regression. This is not surprising, as ridge regression does not shrink coefficients exactly to zero, so if we take a close look at the coefficients, we notice that it is forced to keep coefficients active
set.seed(2)
fit.cv.ridge <- cv.glmnet(x = X, y = y, family = "gaussian", nlambda = 20, lambda.min.ratio = 0.1, alpha = 0)
cbind(beta, fit.cv.ridge$glmnet.fit$beta[,20])
preds
set.seed(2)
fit.cv.best <- cv.glmnet(x = X, y = y, family = "gaussian", nlambda = 20, lambda.min.ratio = 0.1, alpha = 0.94736842)
cbind(beta, fit.cv.ridge$glmnet.fit$beta[,20], fit.cv.best$glmnet.fit$beta[,20])
Q4 (optional): look at the glmnet documentation - which parameters might be interesting to vary?
There are many options to alter in the glmnet
function. Generally, it is best to leave them as default, as they have been set to sensible values by the authors, unless you have a reason to change them. Some ones of interest are:
alpha
: this is discussed in Q3.standardize
: a TRUE/FALSE indicator as to whether the data is standardised. It is good practice to standardise data, so this should be left on. It is not a good idea to standardise the data yourself and then feed this toglmnet
, because standardisation alters how is used (it scales in the backend, see Q1).intercept
: a TRUE/FALSE indicator as to whether an intercept is fit. Again, it is good practice to leave this on, unless you have a strong reason to believe that your regression line goes through the origin (i.e., that your response is centered at 0, which is rare). You also should not center your response yourself, as again various changes occur in the backend if this is set to on. The two options described show that you do not need to do these pre-processing steps yourself,glmnet
will do it for you.penalty.factor
: this allows you to use adaptive penalty weights, which leads to the adaptive lasso (as in SLOPE, see the optional section, although note that this function can not be used for SLOPE, due to the sorting component of SLOPE).
Q5: instead of using the predict function, manually code a prediction.
A linear model is defined as so, to form a prediction, we need to plug in our values of values that we found in the model. We can do that using
set.seed(2)
X <- matrix(rnorm(100 * 20), 100, 20)
beta <- c(rep(5,5),rep(0,15))
y <- X%*%beta + rnorm(100)
library(glmnet)
fit <- glmnet(x = X, y = y, family = "gaussian", lambda = 0.5, intercept = FALSE)
preds_1 = predict(object = fit, newx = X)
preds_2 = X%*%fit$beta
cbind(preds_1[1:5], preds_2[1:5])
Q6: compare the predictions of the 1se model against the min model.
set.seed(2)
X <- matrix(rnorm(100 * 20), 100, 20)
beta <- c(rep(5,5),rep(0,15))
y <- X%*%beta + rnorm(100)
set.seed(3)
X_new <- matrix(rnorm(10 * 20), 10, 20)
y_new <- X_new%*%beta + rnorm(10)
library(glmnet)
fit.cv <- cv.glmnet(x = X, y = y, family = "gaussian", nlambda = 20, lambda.min.ratio = 0.1)
mean((y_new-predict(object = fit.cv, newx = X_new, s = "lambda.1se"))^2)
mean((y_new-predict(object = fit.cv, newx = X_new, s = "lambda.min"))^2)
Q7: compare the grplasso
package to glmnet
to see if standardisation works properly?
To do this, we need to reduce the group lasso to the lasso. We can use singleton groups (each variable in its own group), so that the two models are equivalent. We have set the grplasso
model up to use the in-built standardisation
set.seed(2)
X <- matrix(rnorm(100 * 20), 100, 20)
beta <- c(rep(5,5),rep(0,15))
y <- X%*%beta + rnorm(100)
library(glmnet)
library(grplasso)
lasso_model <- glmnet(x = X, y = y, family = "gaussian", nlambda = 20, lambda.min.ratio = 0.1)
glasso_model <- grplasso(x = cbind(1,X), y = y, index = c(NA,1:ncol(X)), lambda = lasso_model$lambda, standardize = TRUE, center = TRUE, model = LinReg())
Comparing the final solution, we see that they are not the same, so the built-in standardisation is not working as intended.
cbind(lasso_model$beta[,20], glasso_model$coefficients[-1,20])
Q8: set the group indexing so that the signal variables are all in the same group. What do you observe?
Choosing a different grouping, so that the signal variables are in the same group, would lead to the following
set.seed(2)
X <- matrix(rnorm(100 * 20), 100, 20)
beta <- c(rep(5,5),rep(0,15))
y <- X%*%beta + rnorm(100)
groups <- c(rep(1,4),rep(2,4),rep(3,4),rep(4,4),rep(5,4))
groups_gl <- c(NA, groups) # we add NA as the intercept does not belong to a group
# generate lambda sequence
lambda_max_group <- lambdamax(X_gl, y, groups_gl, standardize = FALSE, center = FALSE) # finding lambda max, we can ignore the warning
lambdas_gl <- exp(seq( # calculating the full lambda sequence
from = log(lambda_max_group),
to = log(lambda_max_group * 0.1), # the equivalent of lambda.min.ratio in glmnet
length.out = 20 # how many lambdas we want
))
# fit model
glasso_model <- grplasso(x = X_gl, y = y, index = groups_gl, lambda = lambdas_gl, standardize = FALSE, center = FALSE, model = LinReg())
cbind(beta,glasso_model$coefficients[-1,20])
mean((y_new-predict(object = glasso_model, newdata = cbind(1,X_new))[,20])^2)
Q9 (optional): can you figure out why we get inflated values?
If we turn standardisation off we get:
set.seed(2)
X <- matrix(rnorm(100 * 20), 100, 20)
beta <- c(rep(5,5),rep(0,15))
y <- X%*%beta + rnorm(100)
library(SGL)
sgl_model <- SGL(list(x=X,y=y), groups, type = "linear", nlam = 20, min.frac = 0.1, alpha = 0.95)
sgl_model_2 <- SGL(list(x=X,y=y), groups, type = "linear", nlam = 20, min.frac = 0.1, alpha = 0.95, standardize = FALSE)
cbind(beta, sgl_model$beta[,20], sgl_model_2$beta[,20])
SGL
package. This highlights the issue of pre-processing when it is done incorrectly.
Q10: what happens to the predictive score if we allow to decrease even further?
We can test the predictive score by decreasing to quite an extreme minimum and allowing for more values along the path:
set.seed(2)
X <- matrix(rnorm(100 * 20), 100, 20)
beta <- c(rep(5,5),rep(0,15))
y <- X%*%beta + rnorm(100)
set.seed(3)
X_new <- matrix(rnorm(10 * 20), 10, 20)
y_new <- X_new%*%beta + rnorm(10)
library(SGL)
sgl_model <- SGL(list(x=X,y=y), groups, type = "linear", nlam = 200, min.frac = 0.001, alpha = 0.95)
preds = rep(0,200)
for (i in 1:200){
preds[i] = mean((y_new-predictSGL(x = sgl_model, newX = X_new, lam = i))^2)
}
plot(preds,type="l",ylab="Error")
We see that after a certain point, decreasing further does not provide any additional benefit, only adding more model complexity. Generally, we prefer to use the simplest model that is available, without sacrificing accuracy (a concept known as Occam's Razor).
We can try an even smaller value of min.frac
sgl_model <- SGL(list(x=X,y=y), groups, type = "linear", nlam = 200, min.frac = 0.000001, alpha = 0.95)
mean((y_new-predictSGL(x = sgl_model, newX = X_new, lam = 20))^2)
preds = rep(0,20)
for (i in 1:20){
preds[i] = mean((y_new-predictSGL(x = sgl_model, newX = X_new, lam = i))^2)
}
plot(preds,type="b",ylab="Error")
The prediction error continues to decrease but we are adding a lot of variance into the model by overfitting. This is ok for a simple example like this, but when there are more predictors it can become problematic, especially if the divide between signal and noise is less apparant.
Q11: vary in the region . What do you observe?
set.seed(2)
X <- matrix(rnorm(100 * 20), 100, 20)
beta <- c(rep(5,5),rep(0,15))
y <- X%*%beta + rnorm(100)
set.seed(3)
X_new <- matrix(rnorm(10 * 20), 10, 20)
y_new <- X_new%*%beta + rnorm(10)
library(SGL)
alpha_seq = seq(0,1,length.out=20)
preds = data.frame(alpha = alpha_seq, error = rep(0,length(alpha_seq)))
for (i in 1:length(alpha_seq)){
set.seed(2)
fit <- SGL(list(x=X,y=y), groups, type = "linear", nlam = 20, min.frac = 0.001, alpha = alpha_seq[i])
preds$error[i] = mean((y_new-predictSGL(x = fit, newX = X_new, lam = 20))^2)
}
plot(preds$alpha, preds$error, type = "b", ylab="Error",xlab="Alpha")
We observe the error shrinking as gets close to 1 (the lasso). This is expected in this scenario, as we did not add a grouping structure to the synthetic data.
Q12: can you use the SGL
R package to fit the lasso and group lasso?
To do this, we just set alpha
to 0 and 1:
set.seed(2)
X <- matrix(rnorm(100 * 20), 100, 20)
beta <- c(rep(5,5),rep(0,15))
y <- X%*%beta + rnorm(100)
library(SGL)
fit_glasso <- SGL(list(x=X,y=y), groups, type = "linear", nlam = 20, min.frac = 0.001, alpha = 0)
fit_lasso <- SGL(list(x=X,y=y), groups, type = "linear", nlam = 20, min.frac = 0.001, alpha = 1)
glmnet
and grplasso
packages, it is still recommended to use the package specific to the model, as it will be more optimised.