Estimating the weighted covariance matrix in R

Statistics
Author

Edoardo Costantini

Published

March 14, 2022

Introduction

In a sample made of groups of different sizes, descriptive statistics like the mean and the covariance between variables can be computed by assigning proper weights to account for the difference in group sizes. Wights are generally normalized (i.e., \(\sum_{i = 1}^{n} w_i = 1\)).

Learn by coding

Example

Now, let’s consider a very simple example. Say that you have a dataset with two variables and that you have a vector of weights defining how important each observation should be.

# Initial simple example -------------------------------------------------------

  # Get the dataset used in the example of stats::cov.wt()
  xy <- cbind(x = 1:10, y = c(1:3, 8:5, 8:10))

  # Define non-negative weights (as in example of stats::cov.wt())
  wi <- c(0,0,0,1,1,1,1,1,0,0)

  # Get the weighted estimate with the default methods
  covwt_stats <- stats::cov.wt(xy, wt = wi) # i.e. method = "unbiased"

  # Compare unweighted and weighted means
  data.frame(uw = colMeans(xy),
             select = colMeans(xy[wi == 1, ]),
             wg = covwt_stats$center)
   uw select  wg
x 5.5    6.0 6.0
y 5.9    6.8 6.8
  # Compare unweighted and weighted covariance matrix
  data.frame(uw = c(cov(xy)),
             select = c(cov(xy[wi == 1, ])),
             wg = c(covwt_stats$cov),
             row.names = c(sapply(colnames(cov(xy)), paste0, rownames(cov(xy))))
  )
         uw select   wg
xx 9.166667    2.5  2.5
xy 8.055556   -0.5 -0.5
yx 8.055556   -0.5 -0.5
yy 9.433333    1.7  1.7

Note how by weighting with a vector of 0 and 1s we are basically saying that the observations with a 0 will be excluded from the count. They are weighted to have 0 impact on the computation of the descriptive statistics. This is clear when you compare the results of the select and wg columns.

Computing the weighted covariance matrix manually

We could replicate the results of the weighting simply by selecting a subset of the original data because all observations were either weighted 0 or equally (1). When this is not the case, weighting is slightly more complicated.

Exploring the stats::cov.wt() function code

Let’s look at how the cov.wt() function works more in depth. The internal code of the function is the following:

# Examine the internal code of stats::cov.wt() ---------------------------------

  cov.wt
function (x, wt = rep(1/nrow(x), nrow(x)), cor = FALSE, center = TRUE, 
    method = c("unbiased", "ML")) 
{
    if (is.data.frame(x)) 
        x <- as.matrix(x)
    else if (!is.matrix(x)) 
        stop("'x' must be a matrix or a data frame")
    if (!all(is.finite(x))) 
        stop("'x' must contain finite values only")
    n <- nrow(x)
    if (with.wt <- !missing(wt)) {
        if (length(wt) != n) 
            stop("length of 'wt' must equal the number of rows in 'x'")
        if (any(wt < 0) || (s <- sum(wt)) == 0) 
            stop("weights must be non-negative and not all zero")
        wt <- wt/s
    }
    if (is.logical(center)) {
        center <- if (center) 
            colSums(wt * x)
        else 0
    }
    else {
        if (length(center) != ncol(x)) 
            stop("length of 'center' must equal the number of columns in 'x'")
    }
    x <- sqrt(wt) * sweep(x, 2, center, check.margin = FALSE)
    cov <- switch(match.arg(method), unbiased = crossprod(x)/(1 - 
        sum(wt^2)), ML = crossprod(x))
    y <- list(cov = cov, center = center, n.obs = n)
    if (with.wt) 
        y$wt <- wt
    if (cor) {
        Is <- 1/sqrt(diag(cov))
        R <- cov
        R[] <- Is * cov * rep(Is, each = nrow(cov))
        y$cor <- R
    }
    y
}
<bytecode: 0x7fa3526d4b58>
<environment: namespace:stats>

Note the following:

  • The first thing to pay attention to is that the function can compute the weighted covariance matrix in two ways:

    • unbiased, using corssprod(x) / (1 - sum(wt^2))
    • ML (or maximum likelihood), using corssprod(x)
  • Note that the wt object is divided by the sum of the values it is storing, which amounts to normalising the weights. This happens with wt <- wt/s with s being created inside an if statement as s <- sum(wt).

  • x is centered on the normalized weigthed means using the sweep function

  • x is weighted by multiplying by sqrt(wt)

Reproducing the internal steps

First, we’ll set up a few objects we need to replicate manually what happens inside the stats::cov.wt() function. We need to define a dataset, a vector of weights, a method to compute descriptives, and based on these we will also create an object to store the number of rows (n). As a vector of weights we sample random values between 0 and 1. We can think of this as an attempt to weight each observation for the probability of sampling them from a population.

# Set up manual computation of cov.wt() ----------------------------------------

  # Assign values to the function arguments
  x      <- xy                     # data
  set.seed(20220314)
  wi     <- runif(length(wi), min = 0, max = 1)
  method <- "ML"                   # use Maximum Likelihood for estimation

  # Assign values to some of the internal objects
  n <- nrow(x)

Next, we want to make sure we normalize the weights. In other words we want to make sure the weights sum to 1.

# Normalize weights ------------------------------------------------------------

  # Normalise weights (to sum to 1)
  wn <- wi / sum(wi)

  # Check they sum to 1
  sum(wn) == 1
[1] TRUE

Then, we want to compute the vector of weighted means.

# Compute the weighted means ---------------------------------------------------

  # Center on weighted mean if required
  center <- colSums(wn * x)

  # Center X on the weigthed mean
  x_cent <- sweep(x, 2, center, check.margin = FALSE)

  # Note that the sweep is subtracting the "center" to each value
  all.equal(
    sweep(x, 2, center, check.margin = FALSE),
    t(apply(x, 1, function (i) i - center))
  )
[1] TRUE

Note that the weighted mean is computed as: [ {x} = _{i = 1}^{n} w_i x_i ] and that center <- colSums(wn * x) is doing exactly this.

Finally, we want to compute the weighted covariance matrix.

# Compute the weighted covariance matrix ---------------------------------------

  # Weight (centered) data
  x_weighted <- sqrt(wn) * x_cent

  # Compute the ML weigthed covariance matrix manually
  covwt_man <- crossprod(x_weighted)

  # Print the manual weigthed covariance matrix
  covwt_man
         x        y
x 7.102015 5.431419
y 5.431419 6.210209
  # Compute the ML weigthed covariance matrix with stats::cov.wt()
  covwt_stats <- cov.wt(xy, wt = wi, method = "ML", center = TRUE)$cov

  # Compare manual and stats weigthed covariance matrices
  covwt_man - covwt_stats
  x y
x 0 0
y 0 0

Mathematical formula and alternative R computations

Unbiased weighted covariance matrix

For a given population covariance matrix \(Q\), each element \(q_{ik}\) of the unbiased estimation of the weighted covariance matrix \(\hat{Q}\) can be computed with the following formula:

[ q_{ik} = {i = 1}^{n} w_i (x{ij} - {x}j) (x{ij} - {x}_k) ]

with \(\bar{x}_j\) being the weighted mean for variable \(j\), and \(w_i\) being the normalized weight for a given observation (which we store in the vector wn). The following are alternative ways of computing it with mathematical or R shortcuts:

# Alternative computations of the unbiased weighted covariance mat -------------

  # Literal translation of equation
  1 / (1 - sum(wn^2)) * t(wn * x_cent) %*% (x_cent)
         x        y
x 8.320767 6.363485
y 6.363485 7.275922
  # Rearrange denominator
  t(wn * x_cent) %*% (x_cent) / (1 - sum(wn^2))
         x        y
x 8.320767 6.363485
y 6.363485 7.275922
  # Spread wn
  t(sqrt(wn) * x_cent) %*% (sqrt(wn)*x_cent) / (1 - sum(wn^2))
         x        y
x 8.320767 6.363485
y 6.363485 7.275922
  # Replace manual cross-product with R cross-product
  crossprod(sqrt(wn) * x_cent)/(1 - sum(wn^2))
         x        y
x 8.320767 6.363485
y 6.363485 7.275922
  # Compute with stats::cov.wt()
  cov.wt(xy, wt = wi, method = "unbiased", center = TRUE)$cov
         x        y
x 8.320767 6.363485
y 6.363485 7.275922

Maximum Likelihood weighted covariance matrix

Each element \(q_{ik}\) of the maximum likelihood weighted covariance matrix estimate \(\hat{Q}\) can be computed manually with the following formula:

[ q_{ik} = {i = 1}^{n} w_i (x{ij} - {x}j) (x{ij} - {x}_k) ]

with \(\bar{x}_j\) being the weighted mean for variable \(j\), and \(w_i\) being the normalized weight for a given observation. The following are alternative ways of computing it with mathematical or R shortcuts:

# Alternative computations of the ML weighted covariance mat -------------------

  # R manual cross-product using un-normalised weights
  1 / sum(wi) * t(wi * x_cent) %*% (x_cent)
         x        y
x 7.102015 5.431419
y 5.431419 6.210209
  # Using the normalised weights
  1 / sum(wn) * t(wn * x_cent) %*% (x_cent)
         x        y
x 7.102015 5.431419
y 5.431419 6.210209
  # Dropp the term = 1
  t(wn * x_cent) %*% (x_cent)
         x        y
x 7.102015 5.431419
y 5.431419 6.210209
  # Spread wn
  t(sqrt(wn) * x_cent) %*% (sqrt(wn) * x_cent)
         x        y
x 7.102015 5.431419
y 5.431419 6.210209
  # Replace manual cross-product with R cross-product
  crossprod(sqrt(wn) * x_cent)
         x        y
x 7.102015 5.431419
y 5.431419 6.210209
  # R cross-product matrix
  crossprod(x_weighted)
         x        y
x 7.102015 5.431419
y 5.431419 6.210209
  # Compute with stats::cov.wt()
  cov.wt(xy, wt = wi, method = "ML", center = TRUE)$cov
         x        y
x 7.102015 5.431419
y 5.431419 6.210209

Relationship with the matrix of sufficient statistics

Note the relationship between the covariance matrix and the matrix of sufficient statistics is the same as in the unweighted case: \(\text{cov} = \frac{T_{\text{obs}}}{n}\).

# Obtain the matrix of sufficient statistics Tobs ------------------------------

  # Define a new weigthing object
  set.seed(20220314)
  wi <- runif(length(wi), min = 0, max = 1)
  wn <- wi / sum(wi)

  # Compute the weighted means of X again
  center <- colSums(wn * x)
  x_cent <- sweep(x, 2, center, check.margin = FALSE)

  # "Effective" sample size
  n <- sum(wi)

  # Number of columns
  p <- ncol(x)

  # Obtain matrix of sufficient statistics (Tobs)
  Tobs_lopp <- matrix(0, p, p)
  for(i in 1:nrow(x)){
    Tobs_lopp <- Tobs_lopp + wi[i] * (x_cent[i, ]) %*% t(x_cent[i, ])
  }

  # Obtain matrix of sufficient statistics (Tobs) w/ cross-product shortcut
  Tobs_cp <- t(wi * x_cent) %*% x_cent

  # Compare loop version and cross-product shortcut
  Tobs_lopp - Tobs_cp
     x            y
[1,] 0 3.552714e-15
[2,] 0 0.000000e+00
  # Assign simpler name and print Tobs
  (Tobs <- Tobs_cp)
         x        y
x 31.08417 23.77229
y 23.77229 27.18091
  # Convert to a covariance matrix
  covmat <- Tobs / n

  # Check it's what you were expecting
  covmat - cov.wt(xy, wt = wi, method = "ML", center = TRUE)$cov
  x             y
x 0 -8.881784e-16
y 0 -8.881784e-16

Note the following:

  • we are using the normalized weights wn to center the data, but we are using the un-normalised weights to scale the data contribution to Tobs
  • if we had used the normalized weights, \(n\) would have been equal to 1 and covmat would be equal to Tobs.
# Obtain the matrix of sufficient statistics Tobs (normalised weights) ---------

  # Convert to a covariance matrix
  covmat - t(wn * x_cent) %*% x_cent
             x y
x 0.000000e+00 0
y 8.881784e-16 0
  # Then, covmat relates to Tobs as
  (t(wn * x_cent) %*% x_cent * n) - Tobs_cp
              x y
x  0.000000e+00 0
y -3.552714e-15 0
  # So we could say
  Tobs <- t(wn * x_cent) %*% x_cent * n

TL;DR, just give me the code!

# Initial simple example -------------------------------------------------------

  # Get the dataset used in the example of stats::cov.wt()
  xy <- cbind(x = 1:10, y = c(1:3, 8:5, 8:10))

  # Define non-negative weights (as in example of stats::cov.wt())
  wi <- c(0,0,0,1,1,1,1,1,0,0)

  # Get the weighted estimate with the default methods
  covwt_stats <- stats::cov.wt(xy, wt = wi) # i.e. method = "unbiased"

  # Compare unweighted and weighted means
  data.frame(uw = colMeans(xy),
             select = colMeans(xy[wi == 1, ]),
             wg = covwt_stats$center)

  # Compare unweighted and weighted covariance matrix
  data.frame(uw = c(cov(xy)),
             select = c(cov(xy[wi == 1, ])),
             wg = c(covwt_stats$cov),
             row.names = c(sapply(colnames(cov(xy)), paste0, rownames(cov(xy))))
  )

# Examine the internal code of stats::cov.wt() ---------------------------------

  cov.wt

# Set up manual computation of cov.wt() ----------------------------------------

  # Assign values to the function arguments
  x      <- xy                     # data
  set.seed(20220314)
  wi     <- runif(length(wi), min = 0, max = 1)
  method <- "ML"                   # use Maximum Likelihood for estimation

  # Assign values to some of the internal objects
  n <- nrow(x)

# Normalize weights ------------------------------------------------------------

  # Normalise weights (to sum to 1)
  wn <- wi / sum(wi)

  # Check they sum to 1
  sum(wn) == 1


# Compute the weighted means ---------------------------------------------------

  # Center on weighted mean if required
  center <- colSums(wn * x)

  # Center X on the weigthed mean
  x_cent <- sweep(x, 2, center, check.margin = FALSE)

  # Note that the sweep is subtracting the "center" to each value
  all.equal(
    sweep(x, 2, center, check.margin = FALSE),
    t(apply(x, 1, function (i) i - center))
  )


# Compute the weighted covariance matrix ---------------------------------------

  # Weight (centered) data
  x_weighted <- sqrt(wn) * x_cent

  # Compute the ML weigthed covariance matrix manually
  covwt_man <- crossprod(x_weighted)

  # Print the manual weigthed covariance matrix
  covwt_man

  # Compute the ML weigthed covariance matrix with stats::cov.wt()
  covwt_stats <- cov.wt(xy, wt = wi, method = "ML", center = TRUE)$cov

  # Compare manual and stats weigthed covariance matrices
  covwt_man - covwt_stats

# Alternative computations of the unbiased weighted covariance mat -------------

  # Literal translation of equation
  1 / (1 - sum(wn^2)) * t(wn * x_cent) %*% (x_cent)

  # Rearrange denominator
  t(wn * x_cent) %*% (x_cent) / (1 - sum(wn^2))

  # Spread wn
  t(sqrt(wn) * x_cent) %*% (sqrt(wn)*x_cent) / (1 - sum(wn^2))

  # Replace manual cross-product with R cross-product
  crossprod(sqrt(wn) * x_cent)/(1 - sum(wn^2))

  # Compute with stats::cov.wt()
  cov.wt(xy, wt = wi, method = "unbiased", center = TRUE)$cov

# Alternative computations of the ML weighted covariance mat -------------------

  # R manual cross-product using un-normalised weights
  1 / sum(wi) * t(wi * x_cent) %*% (x_cent)

  # Using the normalised weights
  1 / sum(wn) * t(wn * x_cent) %*% (x_cent)

  # Dropp the term = 1
  t(wn * x_cent) %*% (x_cent)

  # Spread wn
  t(sqrt(wn) * x_cent) %*% (sqrt(wn) * x_cent)

  # Replace manual cross-product with R cross-product
  crossprod(sqrt(wn) * x_cent)

  # R cross-product matrix
  crossprod(x_weighted)

  # Compute with stats::cov.wt()
  cov.wt(xy, wt = wi, method = "ML", center = TRUE)$cov

# Obtain the matrix of sufficient statistics Tobs ------------------------------

  # Define a new weigthing object
  set.seed(20220314)
  wi <- runif(length(wi), min = 0, max = 1)
  wn <- wi / sum(wi)

  # Compute the weighted means of X again
  center <- colSums(wn * x)
  x_cent <- sweep(x, 2, center, check.margin = FALSE)

  # "Effective" sample size
  n <- sum(wi)

  # Number of columns
  p <- ncol(x)

  # Obtain matrix of sufficient statistics (Tobs)
  Tobs_lopp <- matrix(0, p, p)
  for(i in 1:nrow(x)){
    Tobs_lopp <- Tobs_lopp + wi[i] * (x_cent[i, ]) %*% t(x_cent[i, ])
  }

  # Obtain matrix of sufficient statistics (Tobs) w/ cross-product shortcut
  Tobs_cp <- t(wi * x_cent) %*% x_cent

  # Compare loop version and cross-product shortcut
  Tobs_lopp - Tobs_cp

  # Assign simpler name and print Tobs
  (Tobs <- Tobs_cp)

  # Convert to a covariance matrix
  covmat <- Tobs / n

  # Check it's what you were expecting
  covmat - cov.wt(xy, wt = wi, method = "ML", center = TRUE)$cov

# Obtain the matrix of sufficient statistics Tobs (normalised weights) ---------

  # Convert to a covariance matrix
  covmat - t(wn * x_cent) %*% x_cent

  # Then, covmat relates to Tobs as
  (t(wn * x_cent) %*% x_cent * n) - Tobs_cp

  # So we could say
  Tobs <- t(wn * x_cent) %*% x_cent * n