# Set up -----------------------------------------------------------------------
# Load packages
library(glmnet)
# Take the mtcars data
y <- mtcars[, "mpg"]
X <- mtcars[, -1]
# Create a few shorthands we will use
n <- nrow(X)
p <- ncol(X)Introduction
When there are many correlated predictors in a linear regression model, their regression coefficients can become poorly determined and exhibit high variance. This problem can be alleviated by imposing a size constraint (or penalty) on the coefficients. Ridge regression shrinks the regression coefficients by imposing a penalty on their size. The ridge coefficients values minimize a penalized residual sum of squares:
\[ \hat{\beta}^{\text{ridge}} = \text{argmin}_{\beta} \left\{ \sum_{i=1}^{N} \left( y_i - \beta_0 - \sum_{j=1}^{p} x_{ij}\beta_j \right)^2 + \lambda \sum_{j=1}^{p}\beta_j^2 \right\} \]
The ridge solutions are not equivariant under scaling of the inputs. Therefore, it is recommended to standardize the inputs before solving the minimization problem.
Notice that the intercept \(\beta_0\) has been left out of the penalty term. Penalization of the intercept would make the procedure depend on the origin chosen for \(Y\). Furthermore, by centering the predictors, we can separate the solution to the minimazion problem into two parts:
Intercept \[ \hat{\beta}_0 = \bar{y}=\frac{1}{N}\sum_{i = 1}^{N} y_i \]
Penalised regression coefficients \[ \hat{\beta}^{\text{ridge}}=(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^Ty \] which is the regular way of estimating regression coefficients with a penalty term (\(\lambda\)) added on the diagonal (\(\mathbf{I}\)) of the cross-product matrix (\(\mathbf{X}^T\mathbf{X}\)) to make it invertible (\((...)^{-1}\)).
Learn by coding
The glmnet package can be used to obtain the ridge regression estimates of the regression coefficients. In this section, we will first see how to obtain these estimates “manually”, that is coding every step on our own, and then we will see how to obtain the same results using the glmnet package.
Let’s start by setting up the R environment. In this post, we will work with the mtcars data. If you are not familiar with it, just look up the R help file on it. We will use the first column of the dataset (variable named mpg) as a dependent variable and the remaining ones as predictors in a linear regression.
Fitting ridge regression manually
First, let’s make sure the predictors are centered on the mean and scaled to have a variance of 1.
# Fitting ridge regression manually --------------------------------------------
# Scale the data (standardize)
X_scale <- scale(X, center = TRUE, scale = TRUE)Then, we want to fit the ridge regression manually by separating the intercept and the regression coefficients estimation (two-step approach):
Estimate the intercept (\(\hat{\beta}_0\))
::: {.cell}
# Estimate the intercept b0_hat_r <- mean(y):::
Estimate the ridge regression coefficients (\(\hat{\beta}^{\text{ridge}}\)).
Compute the cross-product matrix of the predictors.
This is the same step we would take if we wanted to compute the OLS estimates.
::: {.cell}
# Compute the cross-product matrix of the data XtX <- t(X_scale) %*% X_scale:::
Define a value of \(\lambda\).
This value is usually chosen by cross-validation from a grid of possible values. However, here we are only interested in how \(\lambda\) is used in the computation, so we can simply give it a fixed value.
::: {.cell}
# Define a lambda value lambda <- .1:::
Compute \(\hat{\beta}^{\text{ridge}}\).
::: {.cell}
# Estimate the regression coefficients with the ridge penalty bs_hat_r <- solve(XtX + lambda * diag(p)) %*% t(X_scale) %*% y:::
where
diag(p)is the identity matrix \(\mathbf{I}\).
Finally, let’s print the results:
# Print the results
round(
data.frame(twostep = c(b0 = b0_hat_r,
b = bs_hat_r)),
3
) twostep
b0 20.091
b1 -0.194
b2 1.366
b3 -1.373
b4 0.438
b5 -3.389
b6 1.361
b7 0.162
b8 1.243
b9 0.496
b10 -0.460
It is important to note the effect of centering and scaling. When fitting ridge regression, many sources recommend centering the data. This allows separating the estimation of the intercept from the estimation of the regression coefficients. As a result, only the regression coefficients are penalised. To understand the effect of centering, consider what happens in regular OLS estimation when predictors are centered:
# Centering in regular OLS -----------------------------------------------------
# Create a version of X that is centered
X_center <- scale(X, center = TRUE, scale = FALSE)
# Fit an regular linear model
lm_ols <- lm(y ~ X_center)
# Check that b0 is equal to the mean of y
coef(lm_ols)["(Intercept)"] - mean(y) (Intercept)
-3.552714e-15
Furthermore, let’s see what would have happened if we had penalised the intercept as well.
# Consequence of penalising the intercept --------------------------------------
# Add a vector of 1s to penalise the intercept
X_scale_w1 <- cbind(1, X_scale)
# Compute the cross-product matrix of the data
XtX <- t(X_scale_w1) %*% X_scale_w1
# Estimate the regression coefficients with the ridge penalty
bs_hat_r_w1 <- solve(XtX + lambda * diag(p+1)) %*% t(X_scale_w1) %*% y
# Print the results
round(
data.frame(twostep = c(b0 = b0_hat_r,
b = bs_hat_r),
onestep = c(b0 = bs_hat_r_w1[1],
b = bs_hat_r_w1[-1])),
3
) twostep onestep
b0 20.091 20.028
b1 -0.194 -0.194
b2 1.366 1.366
b3 -1.373 -1.373
b4 0.438 0.438
b5 -3.389 -3.389
b6 1.361 1.361
b7 0.162 0.162
b8 1.243 1.243
b9 0.496 0.496
b10 -0.460 -0.460
As you see, the intercept would be shrunk toward zero, without any benefit. As a result, any prediction would also be offset by the same amount.
An alternative way to avoid penalising the intercept
It can be handy to obtain estimates of the regression coefficients and intercept in one step. We can use matrix algebra and R to simplify the two-step procedure to a single step. In particular, we can avoid the penalisation of the intercept by setting to 0 the first element of the “penalty” matrix lambda * diag(p + 1).
# Alternative to avoid penalization of the intercept ---------------------------
# Compute cross-product matrix
XtX <- crossprod(X_scale_w1)
# Create penalty matrix
pen <- lambda * diag(p + 1)
# replace first element with 0
pen[1, 1] <- 0
# Obtain standardized estimates
bs_hat_r2 <- solve(XtX + pen) %*% t(X_scale_w1) %*% (y)
# Compare
round(
data.frame(
twostep = c(b0 = b0_hat_r, b = bs_hat_r),
onestep = c(b0 = bs_hat_r2[1], b = bs_hat_r2[-1])
),
3
) twostep onestep
b0 20.091 20.091
b1 -0.194 -0.194
b2 1.366 1.366
b3 -1.373 -1.373
b4 0.438 0.438
b5 -3.389 -3.389
b6 1.361 1.361
b7 0.162 0.162
b8 1.243 1.243
b9 0.496 0.496
b10 -0.460 -0.460
Fit ridge regression with glmnet
The most popular R package to fit regularised regression is glmnet. Let’s see how we can replicate the results we obtained with the manual approach with glmnet. There are three important differences to consider:
glmnetuses the biased sample variance estimate when scaling the predictors;glmnetreturns the unstandardized regression coefficients;glmnetuses a different parametrization for \(\lambda\).
To obtain the same results with the manual approach and glmnet we need to account for these.
Use the biased estimation of variance
First, let’s use the biased sample variance estimate in computing \(\hat{\beta}^{\text{ridge}}\) with the manual approach:
# Fitting ridge manually with biased variance estimation -----------------------
# Standardize X
X_scale <- sapply(1:p, function (j){
muj <- mean(X[, j]) # mean
sj <- sqrt( var(X[, j]) * (n-1) / n) # (biased) sd
(X[, j] - muj) / sj # center and scale
})
# Craete the desing matrix
X_scale_dm <- cbind(1, X_scale)
# Compute cross-product matrix
XtX <- crossprod(X_scale_dm)
# Create penalty matrix
pen <- lambda * diag(p + 1)
pen[1, 1] <- 0
# Obtain standardized estimates
bs_hat_r3 <- solve(XtX + pen) %*% t(X_scale_dm) %*% (y)
# Print results
round(
data.frame(
manual = c(b0 = bs_hat_r3[1], b = bs_hat_r3[-1])
),
3
) manual
b0 20.091
b1 -0.191
b2 1.353
b3 -1.354
b4 0.430
b5 -3.343
b6 1.343
b7 0.159
b8 1.224
b9 0.488
b10 -0.449
Return the unstandardized coefficients
Next, we need to revert these regression coefficients to their original scale. Since we are estimating the regression coefficients on the scaled data, they are computed on the standardized scale.
# Return the unstandardized coefficients --------------------------------------
# Extract the original mean and standard deviations of all X variables
mean_x <- colMeans(X)
sd_x <- sqrt(apply(X, 2, var) * (n - 1) / n) # biased version
# Revert to original scale
bs_hat_r4 <- c(bs_hat_r3[1] - crossprod(mean_x, bs_hat_r3[-1] / sd_x),
bs_hat_r3[-1] / sd_x)
# Compare manual standardized and unstandardized results
round(
data.frame(
standardized = c(b0 = bs_hat_r3[1], b = bs_hat_r3[-1]),
unstandardized = c(b0 = bs_hat_r4[1], b = bs_hat_r4[-1])
),
3
) standardized unstandardized
b0 20.091 12.908
b1 -0.191 -0.109
b2 1.353 0.011
b3 -1.354 -0.020
b4 0.430 0.818
b5 -3.343 -3.471
b6 1.343 0.764
b7 0.159 0.320
b8 1.224 2.491
b9 0.488 0.672
b10 -0.449 -0.282
Adjust the parametrization of \(\lambda\) for glmnet
Next, we need to understand the relationship between the \(\lambda\) parametrization we used and the one used by glmnet. The following code shows that if we want to use a given value of lambda in glmnet we need to multiply it by the standard deviation of the dependent variable (sd_y) and divide it by the sample size (n).
# Adjust the parametrization of lambda -----------------------------------------
# Extract the original mean and standard deviations of y (for lambda parametrization)
mean_y <- mean(y)
sd_y <- sqrt(var(y) * (n - 1) / n)
# Compute the value glmnet wants for your target lambda
lambda_glmnet <- sd_y * lambda / nCompare manual and glmnet ridge regression output
Finally, we can compare the results:
# Fitting ridge regression with glmnet -----------------------------------------
# Fit glmnet
fit_glmnet_s <- glmnet(x = X,
y = y,
alpha = 0,
lambda = lambda_glmnet, # correction for how penalty is used
thresh = 1e-20)
bs_glmnet <- coef(fit_glmnet_s)
# Compare estimated coefficients
round(
data.frame(
manual = c(b0 = bs_hat_r4[1], b = bs_hat_r4[-1]),
glmnet = c(b0 = bs_glmnet[1], b = bs_glmnet[-1])
),
3
) manual glmnet
b0 12.908 12.908
b.cyl -0.109 -0.109
b.disp 0.011 0.011
b.hp -0.020 -0.020
b.drat 0.818 0.818
b.wt -3.471 -3.471
b.qsec 0.764 0.764
b.vs 0.320 0.320
b.am 2.491 2.491
b.gear 0.672 0.672
b.carb -0.282 -0.282
TL;DR, just give me the code!
# Set up -----------------------------------------------------------------------
# Load packages
library(glmnet)
# Take the mtcars data
y <- mtcars[, "mpg"]
X <- mtcars[, -1]
# Create a few shorthands we will use
n <- nrow(X)
p <- ncol(X)
# Fitting ridge regression manually --------------------------------------------
# Scale the data (standardize)
X_scale <- scale(X, center = TRUE, scale = TRUE)
# Estimate the intercept
b0_hat_r <- mean(y)
# Compute the cross-product matrix of the data
XtX <- t(X_scale) %*% X_scale
# Define a lambda value
lambda <- .1
# Estimate the regression coefficients with the ridge penalty
bs_hat_r <- solve(XtX + lambda * diag(p)) %*% t(X_scale) %*% y
# Print the results
round(
data.frame(twostep = c(b0 = b0_hat_r,
b = bs_hat_r)),
3
)
# Centering in regular OLS -----------------------------------------------------
# Create a version of X that is centered
X_center <- scale(X, center = TRUE, scale = FALSE)
# Fit an regular linear model
lm_ols <- lm(y ~ X_center)
# Check that b0 is equal to the mean of y
coef(lm_ols)["(Intercept)"] - mean(y)
# Consequence of penalising the intercept --------------------------------------
# Add a vector of 1s to penalise the intercept
X_scale_w1 <- cbind(1, X_scale)
# Compute the cross-product matrix of the data
XtX <- t(X_scale_w1) %*% X_scale_w1
# Estimate the regression coefficients with the ridge penalty
bs_hat_r_w1 <- solve(XtX + lambda * diag(p+1)) %*% t(X_scale_w1) %*% y
# Print the results
round(
data.frame(twostep = c(b0 = b0_hat_r,
b = bs_hat_r),
onestep = c(b0 = bs_hat_r_w1[1],
b = bs_hat_r_w1[-1])),
3
)
# Alternative to avoid penalization of the intercept ---------------------------
# Compute cross-product matrix
XtX <- crossprod(X_scale_w1)
# Create penalty matrix
pen <- lambda * diag(p + 1)
# replace first element with 0
pen[1, 1] <- 0
# Obtain standardized estimates
bs_hat_r2 <- solve(XtX + pen) %*% t(X_scale_w1) %*% (y)
# Compare
round(
data.frame(
twostep = c(b0 = b0_hat_r, b = bs_hat_r),
onestep = c(b0 = bs_hat_r2[1], b = bs_hat_r2[-1])
),
3
)
# Fitting ridge manually with biased variance estimation -----------------------
# Standardize X
X_scale <- sapply(1:p, function (j){
muj <- mean(X[, j]) # mean
sj <- sqrt( var(X[, j]) * (n-1) / n) # (biased) sd
(X[, j] - muj) / sj # center and scale
})
# Craete the desing matrix
X_scale_dm <- cbind(1, X_scale)
# Compute cross-product matrix
XtX <- crossprod(X_scale_dm)
# Create penalty matrix
pen <- lambda * diag(p + 1)
pen[1, 1] <- 0
# Obtain standardized estimates
bs_hat_r3 <- solve(XtX + pen) %*% t(X_scale_dm) %*% (y)
# Print results
round(
data.frame(
manual = c(b0 = bs_hat_r3[1], b = bs_hat_r3[-1])
),
3
)
# Return the unstandardized coefficients --------------------------------------
# Extract the original mean and standard deviations of all X variables
mean_x <- colMeans(X)
sd_x <- sqrt(apply(X, 2, var) * (n - 1) / n) # biased version
# Revert to original scale
bs_hat_r4 <- c(bs_hat_r3[1] - crossprod(mean_x, bs_hat_r3[-1] / sd_x),
bs_hat_r3[-1] / sd_x)
# Compare manual standardized and unstandardized results
round(
data.frame(
standardized = c(b0 = bs_hat_r3[1], b = bs_hat_r3[-1]),
unstandardized = c(b0 = bs_hat_r4[1], b = bs_hat_r4[-1])
),
3
)
# Adjust the parametrization of lambda -----------------------------------------
# Extract the original mean and standard deviations of y (for lambda parametrization)
mean_y <- mean(y)
sd_y <- sqrt(var(y) * (n - 1) / n)
# Compute the value glmnet wants for your target lambda
lambda_glmnet <- sd_y * lambda / n
# Fitting ridge regression with glmnet -----------------------------------------
# Fit glmnet
fit_glmnet_s <- glmnet(x = X,
y = y,
alpha = 0,
lambda = lambda_glmnet, # correction for how penalty is used
thresh = 1e-20)
bs_glmnet <- coef(fit_glmnet_s)
# Compare estimated coefficients
round(
data.frame(
manual = c(b0 = bs_hat_r4[1], b = bs_hat_r4[-1]),
glmnet = c(b0 = bs_glmnet[1], b = bs_glmnet[-1])
),
3
)