Introduction

Principal covariates regression is a method to analyze the relationship between sets of multivariate data in the presence of highly-collinear variables. Principal covariates regression (PCovR) is an alternative approach that modifies the optimization criteria behind PCA to include a supervision aspect . PCovR looks for a low-dimensional representation of \(X\) that accounts for the maximum amount of variation in both \(X\) and \(y\). Compared to regular principal component regression, principal covariates regression PCovR extracts components that account for much of the variability in a set of \(X\) variables and that correlate well with a set of \(Y\) variables. For more information, I recommend reading Vervloet et al. (2015) and De Jong and Kiers (1992). In this post, you can find my R code notes on this method. In these notes, I show the computations used by the PCovR R-package to perform the method.

To understand how PCovR differs from classical PCR we need to complicate the notation. Consider the following decomposition of the data:

\[ \begin{align} \mathbf{T} &= \mathbf{X} \mathbf{W} \\ \mathbf{X} &= \mathbf{T} \mathbf{P}_X + \mathbf{E}_X \\ y &= \mathbf{T} \mathbf{P}_y + \mathbf{e}_y \end{align} \]

where \(\mathbf{T}\) is the matrix of PCs defined above, \(\mathbf{W}\) is a \(p \times q\) matrix of component weights defining what linear combination of the columns of \(\mathbf{X}\) is used to compute the components, and \(\mathbf{P}_X\) and \(\mathbf{P}_y\) are the \(q \times p\) and \(q \times 1\) loading matrices relating the variables in \(\mathbf{X}\) and \(y\) to the component scores in \(\mathbf{T}\), respectively. \(\mathbf{E}_X\) and \(\mathbf{e}_y\) are the reconstruction errors, the information lost by using \(\mathbf{T}\) as summary of \(\mathbf{X}\).

Classical PCA can be formulated as the task of finding the \(\mathbf{W}\) and \(\mathbf{P}_X\) that minimize the reconstruction error \(\mathbf{E}_X\):

\[\begin{equation} (\mathbf{W}, \mathbf{P}) = \underset{\mathbf{W}, \mathbf{P}_X}{\operatorname{argmin}} \lVert \mathbf{X} - \mathbf{XWP}' \rVert^2 \end{equation} \]

subject to the constraint \(\mathbf{P}' \mathbf{P} = \mathbf{I}\). PCovR can be formulated as the task of minimizing a weighted combination of both \(\mathbf{E}_X\) and \(\mathbf{e}_y\):

\[ \begin{equation}\label{eq:PCovR} (\mathbf{W}, \mathbf{P}_X, \mathbf{P}_y) = \underset{\mathbf{W}, \mathbf{P}_X, \mathbf{P}_y}{\operatorname{argmin }} \alpha \lVert (\mathbf{X} - \mathbf{XWP}_X') \rVert^2 + (1 - \alpha) \lVert (y - \mathbf{XWP}_y') \rVert^2 \end{equation} \]

subject to the constraint \(\mathbf{P}' \mathbf{P} = \mathbf{I}\).

The parameter \(\alpha\) defines which reconstruction error is being prioritized. When \(\alpha = 1\), the emphasis is exclusively placed on reconstructing \(\mathbf{X}\), leading PCovR to PCR. When \(\alpha = 0.5\), the importance of \(\mathbf{X}\) and \(y\) is equally weighted, a case that resembles Partial least square, which we discuss below. In practice, its value can be found by cross-validation or according to a sequential procedure based on maximum likelihood principles (Vervloet et al. 2013). In particular, \[ \begin{equation}\label{eq:aml} \alpha_{ML} = \frac{\lVert \mathbf{X} \lVert^2}{\lVert \mathbf{X} \lVert^2 + \lVert y \lVert^2 \frac{\hat{\sigma}_{\mathbf{E}_X}^2}{\hat{\sigma}_{e_y}^2}} \end{equation} \]

where \(\hat{\sigma}_{\mathbf{E}_X}^2\) can be obtained as the unexplained variance by components computed according to classical PCA and \(\hat{\sigma}_{e_y}^2\) can be estimated as the unexplained variance by the linear model regressing \(y\) on \(\mathbf{X}\).

Learn by coding

To understand PCovR and its relationship with PCR we will use the R package PCovR, the data alexithymia, and the package RegularizedSCA to compute a measure of similarity between matrices (Tucker’s congruence). The alexithymia data reports the scores of 122 Belgian psychology students on three scales:

20-item Toronto Alexithymia Scale (TAS-20) measuring the inability to recognize emotions.
Center for Epidemiological Studies Depression Scale (CES-D) measuring depression.
Rosenberg Self-Esteem Scale (RSE).

The main objective of the data collection was to assess how well TAS-20 can predict depression and self-esteem¹. We collect all of the predictors in the X_raw object and the measure of depressive symptomatology CES-D in y_raw.

# Set up environment -----------------------------------------------------------

# Load package that implements this method
library("PCovR")

# Load package for a measure of similarity (TuckerCoef)
library("RegularizedSCA")

# Load package for Tucker congruence between vectors
library("psych")

# Load example data from PCovR package
data(alexithymia)

# Subset data
X_raw <- alexithymia$X
y_raw <- alexithymia$Y[, 1, drop = FALSE]

The first thing we’ll do is explore the data measures of center and spread.

# Check variable means and variances
data.frame(
    mean = c(
        X = colMeans(X_raw),
        y = colMeans(y_raw)
        ),
    variance = c(
        X = apply(X_raw, 2, var), 
        y = apply(y_raw, 2, var)
        )
)

                          mean   variance
X.confused           1.8196721   1.388701
X.right words        1.7950820   1.635348
X.sensations         0.5983607   1.151402
X.describe           2.2213115   1.479542
X.analyze problems   2.5081967   1.161089
X.upset              1.6065574   1.678634
X.puzzled            1.1393443   1.294472
X.let happen         1.2622951   1.302534
X.identify           1.6393443   1.620919
X.essential          2.7131148   1.231066
X.feel about people  1.7213115   1.475410
X.describe more      1.0081967   1.363569
X.going on           0.9836066   1.470803
X.why angry          1.2540984   1.496884
X.daily activities   1.6639344   1.514226
X.entertainment      1.6967213   1.601477
X.reveal feelings    1.6475410   2.097886
X.close              2.8442623   1.240008
X.useful             2.4672131   1.193131
X.hidden meanings    1.2131148   1.045116
y.CES-D             16.4672131 118.350156

We will need to scale the data before performing any form of dimensionality reduction.

# Scale data
X <- scale(X_raw)# * (n - 1) / n
y <- scale(y_raw)# * (n - 1) / n

Once the data has been mean-centered and standardized we can proceed to perform PCovR of the data. First, let’s assume that both \(\alpha\) and the number of PCs we want to compute are given.

# Estimation -------------------------------------------------------------------

# Define fixed parameters
alpha <- .5
npcs <- 5

Then we can use the PCovR::pcovr_est() function to estimate the loadings, and weights and compute the component scores.

# > Estimation with package ----------------------------------------------------

# Estimate with PCovR function
package <- PCovR::pcovr_est(
    X = X,
    Y = y,
    a = alpha,  # fixed alpha
    r = npcs    # fixed number of components
)

Inside the object package we can find all of the desired matrices. For example, we can extract the component scores

# Extract PC scores
T_p <- package$Te

Let’s also compute the component scores manually based on the procedure described by Vervloet et al. (2016):

# > Estimation with Vervolet approach (manual) ---------------------------------

# Compute the hat matrix
Hx <- X %*% solve(t(X) %*% X) %*% t(X)

# Compute the G matrix
G <- alpha * X %*% t(X) / sum(X^2) + (1 - alpha) * Hx %*% y %*% t(y) %*% Hx / sum(y^2)

# Take the eigendecomposition of the G matrix
EG <- eigen(G)

# Take the first npcs eigenvectors of G as the PC scores 
T_m <- EG$vectors[, 1:npcs]

We can now compare the results obtained with the two approaches. To keep the section slim and easy to read we will use the Tucker congruence coefficient to compare matrices. This is a measure of similarity between matrices that ranges between -1 and +1. A congruence greater than 0.95 means the two matrices (or vectors) are virtually identical. First, notice that the PC scores obtained are the same:

# Compare results --------------------------------------------------------------

# T scores
TuckerCoef(T_p, T_m)$tucker_value

[1] 1

# Also interesting
TuckerCoef(T_p, X %*% package$W)$tucker_value

[1] 1

Given the component scores, we can compute all of the other matrices of interest and compare them with the results obtained with the R package PCovR:

# W (weights)
W_p <- package$W
W_m <- solve(t(X) %*% X) %*% t(X) %*% T_m
TuckerCoef(W_p, W_m)$tucker_value

[1] 1

# Px
Px_p <- package$Px
Px_m <- t(W_m) %*% t(X) %*% X
TuckerCoef(t(Px_p), t(Px_m))$tucker_value

[1] 1

# Py
Py_p <- drop(package$Py)
Py_m <- drop(t(W_m) %*% t(X) %*% y) # WtXtY
Py_m <- t(T_m) %*% y #TtY
factor.congruence(x = Py_p, y = Py_m)

     CES-D
[1,]  0.98

# B
B_p <- drop(package$B)
B_m <- drop(W_m %*% Py_m) # WPY
B_m <- drop(W_m %*% t(W_m) %*% t(X) %*% y) # WWtXtY
factor.congruence(B_p, B_m)

     [,1]
[1,]    1

Reverting to PCA

I mentioned before that when \(\alpha = 1\), PCovR reduces to PCR. Let’s see that in action. First, we set the desired value for \(\alpha\):

# Reverting to PCA -------------------------------------------------------------

# Use alpha 1
alpha <- 1

then, we can perform PCovR

# Estimate PCovR
package <- PCovR::pcovr_est(
    X = X,
    Y = y,
    a = alpha,
    r = npcs # fixed number of components
)

and classical PCA according to the Guerra-Urzola et al. (2021)

# Classical PCA
uSVt <- svd(X)
U <- uSVt$u
D <- diag(uSVt$d)
V <- uSVt$v
I <- nrow(X)                              # Define the number of rows
P_hat <- (I - 1)^{-1 / 2} * V %*% D       # Component loadings
W_hat <- (I - 1)^{1 / 2} * V %*% solve(D) # Component weights
T_hat <- (I - 1)^{1 / 2} * U              # Component scores
T_hat <- X %*% W_hat

We can now compare the results again

# The scores obtained with PCovR are the same as the ones obtained with PCA
TuckerCoef(package$Te, T_hat[, 1:npcs])$tucker_value

[1] 1

# The loadings obtained with PCovR are the same as the ones obtained with PCA
TuckerCoef(t(package$Px), P_hat[, 1:npcs])$tucker_value

[1] 1

# The weights obtained with PCovR are the same as the ones obtained with PCA
TuckerCoef(package$W, W_hat[, 1:npcs])$tucker_value

[1] 1

# P is now proportional to W
TuckerCoef(package$W, t(package$Px))$tucker_value

[1] 1

Maximum Likelihood estimation of \(\alpha\)

The value of \(\alpha\) is not usually chosen arbitrarily. One could cross-validate it or compute it with a closed-form solution that relies on the Maximum likelihood approach (Vervloet et al. 2016). Here, I show how to use this latter approach. First, let’s simply fit PCovR by using the PCovR::pcovr() function and setting the model selection argument set to “seq”. As explained in the help-file, this “implies a sequential procedure in which the weighting value is determined on the basis of maximum likelihood principles”, exactly what we want.

# Maximum likelihood tuning of alpha -------------------------------------------

# Fit PCovR
package <- pcovr(
    X = X_raw,
    Y = y_raw,
    rot = "none",
    R = npcs, # fixed number of components
    modsel = "seq" # fastest option
)

Then, we can compute the maximum likelihood value of \(\alpha\) by first computing the error terms and taking their ratio.

# Compute error ratio components
lm_mod <- lm(y ~ -1 + X)
ery <- 1 - summary(lm_mod)$r.squared

Rmin <- npcs
Rmax <- npcs
sing <- svd(X)
vec <- Rmin:Rmax
vec <- c(vec[1] - 1, vec, vec[length(vec)] + 1)
VAF <- c(0, cumsum(sing$d^2) / sum(sing$d^2))
VAF <- VAF[vec + 1]
scr <- array(NA, c(1, length(vec)))
for (u in 2:(length(vec) - 1)) {
    scr[, u] <- (VAF[u] - VAF[u - 1]) / (VAF[u + 1] - VAF[u])
}
erx <- 1 - VAF[which.max(scr)]

We could have computed the error ratio with the PCovR::err() R function:

# Compute error ratio with function
err <- ErrorRatio(
    X = X,
    Y = y,
    Rmin = npcs,
    Rmax = npcs
)

# Is it the same?
err - erx/ery

[1] 1.110223e-15

With this value, it is now easy to compute \(\alpha\):

# Find alpha ML
alpha_ML <- sum(X^2) / (sum(X^2) + sum(y^2) * erx / ery)

# Compare to one found by package
package$a - alpha_ML

[1] 0

TL;DR, just give me the code!

# Set up environment -----------------------------------------------------------

# Load package that implements this method
library("PCovR")

# Load package for a measure of similarity (TuckerCoef)
library("RegularizedSCA")

# Load package for Tucker congruence between vectors
library("psych")

# Load example data from PCovR package
data(alexithymia)

# Subset data
X_raw <- alexithymia$X
y_raw <- alexithymia$Y[, 1, drop = FALSE]

# Check variable means and variances
data.frame(
    mean = c(
        X = colMeans(X_raw),
        y = colMeans(y_raw)
        ),
    variance = c(
        X = apply(X_raw, 2, var), 
        y = apply(y_raw, 2, var)
        )
)

# Scale data
X <- scale(X_raw)# * (n - 1) / n
y <- scale(y_raw)# * (n - 1) / n

# Estimation -------------------------------------------------------------------

# Define fixed parameters
alpha <- .5
npcs <- 5

# > Estimation with package ----------------------------------------------------

# Estimate with PCovR function
package <- PCovR::pcovr_est(
    X = X,
    Y = y,
    a = alpha,  # fixed alpha
    r = npcs    # fixed number of components
)

# Extract PC scores
T_p <- package$Te

# > Estimation with Vervolet approach (manual) ---------------------------------

# Compute the hat matrix
Hx <- X %*% solve(t(X) %*% X) %*% t(X)

# Compute the G matrix
G <- alpha * X %*% t(X) / sum(X^2) + (1 - alpha) * Hx %*% y %*% t(y) %*% Hx / sum(y^2)

# Take the eigendecomposition of the G matrix
EG <- eigen(G)

# Take the first npcs eigenvectors of G as the PC scores 
T_m <- EG$vectors[, 1:npcs]

# Compare results --------------------------------------------------------------

# T scores
TuckerCoef(T_p, T_m)$tucker_value

# Also interesting
TuckerCoef(T_p, X %*% package$W)$tucker_value

# W (weights)
W_p <- package$W
W_m <- solve(t(X) %*% X) %*% t(X) %*% T_m
TuckerCoef(W_p, W_m)$tucker_value

# Px
Px_p <- package$Px
Px_m <- t(W_m) %*% t(X) %*% X
TuckerCoef(t(Px_p), t(Px_m))$tucker_value

# Py
Py_p <- drop(package$Py)
Py_m <- drop(t(W_m) %*% t(X) %*% y) # WtXtY
Py_m <- t(T_m) %*% y #TtY
factor.congruence(x = Py_p, y = Py_m)

# B
B_p <- drop(package$B)
B_m <- drop(W_m %*% Py_m) # WPY
B_m <- drop(W_m %*% t(W_m) %*% t(X) %*% y) # WWtXtY
factor.congruence(B_p, B_m)

# Reverting to PCA -------------------------------------------------------------

# Use alpha 1
alpha <- 1

# Estimate PCovR
package <- PCovR::pcovr_est(
    X = X,
    Y = y,
    a = alpha,
    r = npcs # fixed number of components
)

# Classical PCA
uSVt <- svd(X)
U <- uSVt$u
D <- diag(uSVt$d)
V <- uSVt$v
I <- nrow(X)                              # Define the number of rows
P_hat <- (I - 1)^{-1 / 2} * V %*% D       # Component loadings
W_hat <- (I - 1)^{1 / 2} * V %*% solve(D) # Component weights
T_hat <- (I - 1)^{1 / 2} * U              # Component scores
T_hat <- X %*% W_hat

# The scores obtained with PCovR are the same as the ones obtained with PCA
TuckerCoef(package$Te, T_hat[, 1:npcs])$tucker_value

# The loadings obtained with PCovR are the same as the ones obtained with PCA
TuckerCoef(t(package$Px), P_hat[, 1:npcs])$tucker_value

# The weights obtained with PCovR are the same as the ones obtained with PCA
TuckerCoef(package$W, W_hat[, 1:npcs])$tucker_value

# P is now proportional to W
TuckerCoef(package$W, t(package$Px))$tucker_value

# Maximum likelihood tuning of alpha -------------------------------------------

# Fit PCovR
package <- pcovr(
    X = X_raw,
    Y = y_raw,
    rot = "none",
    R = npcs, # fixed number of components
    modsel = "seq" # fastest option
)

# Compute error ratio components
lm_mod <- lm(y ~ -1 + X)
ery <- 1 - summary(lm_mod)$r.squared

Rmin <- npcs
Rmax <- npcs
sing <- svd(X)
vec <- Rmin:Rmax
vec <- c(vec[1] - 1, vec, vec[length(vec)] + 1)
VAF <- c(0, cumsum(sing$d^2) / sum(sing$d^2))
VAF <- VAF[vec + 1]
scr <- array(NA, c(1, length(vec)))
for (u in 2:(length(vec) - 1)) {
    scr[, u] <- (VAF[u] - VAF[u - 1]) / (VAF[u + 1] - VAF[u])
}
erx <- 1 - VAF[which.max(scr)]


# Compute error ratio with function
err <- ErrorRatio(
    X = X,
    Y = y,
    Rmin = npcs,
    Rmax = npcs
)

# Is it the same?
err - erx/ery

# Find alpha ML
alpha_ML <- sum(X^2) / (sum(X^2) + sum(y^2) * erx / ery)

# Compare to one found by package
package$a - alpha_ML

References

De Jong, Sijmen, and Henk AL Kiers. 1992. “Principal Covariates Regression: Part i. Theory.” Chemometrics and Intelligent Laboratory Systems 14 (1-3): 155–64.

Guerra-Urzola, Rosember, Katrijn Van Deun, Juan C Vera, and Klaas Sijtsma. 2021. “A Guide for Sparse PCA: Model Comparison and Applications.” Psychometrika 86 (4): 893–919.

Vervloet, Marlies, Henk AL Kiers, Wim Van den Noortgate, and Eva Ceulemans. 2015. “PCovR: An r Package for Principal Covariates Regression.” Journal of Statistical Software 65: 1–14.

Vervloet, Marlies, Katrijn Van Deun, Wim Van den Noortgate, and Eva Ceulemans. 2013. “On the Selection of the Weighting Parameter Value in Principal Covariates Regression.” Chemometrics and Intelligent Laboratory Systems 123: 36–43.

———. 2016. “Model Selection in Principal Covariates Regression.” Chemometrics and Intelligent Laboratory Systems 151: 26–33. https://doi.org/https://doi.org/10.1016/j.chemolab.2015.12.004.

Footnotes

Check the helpfile for the alexithymia data in the PCovR package for more information.↩︎