Introduction

In a sample made of groups of different sizes, descriptive statistics like the mean and the covariance between variables can be computed by assigning proper weights to account for the difference in group sizes. Wights are generally normalized (i.e., \(\sum_{i = 1}^{n} w_i = 1\)).

Learn by coding

Example

Now, let’s consider a very simple example. Say that you have a dataset with two variables and that you have a vector of weights defining how important each observation should be.

# Initial simple example -------------------------------------------------------

  # Get the dataset used in the example of stats::cov.wt()
  xy <- cbind(x = 1:10, y = c(1:3, 8:5, 8:10))

  # Define non-negative weights (as in example of stats::cov.wt())
  wi <- c(0,0,0,1,1,1,1,1,0,0)

  # Get the weighted estimate with the default methods
  covwt_stats <- stats::cov.wt(xy, wt = wi) # i.e. method = "unbiased"

  # Compare unweighted and weighted means
  data.frame(uw = colMeans(xy),
             select = colMeans(xy[wi == 1, ]),
             wg = covwt_stats$center)

   uw select  wg
x 5.5    6.0 6.0
y 5.9    6.8 6.8

  # Compare unweighted and weighted covariance matrix
  data.frame(uw = c(cov(xy)),
             select = c(cov(xy[wi == 1, ])),
             wg = c(covwt_stats$cov),
             row.names = c(sapply(colnames(cov(xy)), paste0, rownames(cov(xy))))
  )

         uw select   wg
xx 9.166667    2.5  2.5
xy 8.055556   -0.5 -0.5
yx 8.055556   -0.5 -0.5
yy 9.433333    1.7  1.7

Note how by weighting with a vector of 0 and 1s we are basically saying that the observations with a 0 will be excluded from the count. They are weighted to have 0 impact on the computation of the descriptive statistics. This is clear when you compare the results of the select and wg columns.

Computing the weighted covariance matrix manually

We could replicate the results of the weighting simply by selecting a subset of the original data because all observations were either weighted 0 or equally (1). When this is not the case, weighting is slightly more complicated.

Exploring the `stats::cov.wt()` function code

Let’s look at how the cov.wt() function works more in depth. The internal code of the function is the following:

# Examine the internal code of stats::cov.wt() ---------------------------------

  cov.wt

function (x, wt = rep(1/nrow(x), nrow(x)), cor = FALSE, center = TRUE, 
    method = c("unbiased", "ML")) 
{
    if (is.data.frame(x)) 
        x <- as.matrix(x)
    else if (!is.matrix(x)) 
        stop("'x' must be a matrix or a data frame")
    if (!all(is.finite(x))) 
        stop("'x' must contain finite values only")
    n <- nrow(x)
    if (with.wt <- !missing(wt)) {
        if (length(wt) != n) 
            stop("length of 'wt' must equal the number of rows in 'x'")
        if (any(wt < 0) || (s <- sum(wt)) == 0) 
            stop("weights must be non-negative and not all zero")
        wt <- wt/s
    }
    if (is.logical(center)) {
        center <- if (center) 
            colSums(wt * x)
        else 0
    }
    else {
        if (length(center) != ncol(x)) 
            stop("length of 'center' must equal the number of columns in 'x'")
    }
    x <- sqrt(wt) * sweep(x, 2, center, check.margin = FALSE)
    cov <- switch(match.arg(method), unbiased = crossprod(x)/(1 - 
        sum(wt^2)), ML = crossprod(x))
    y <- list(cov = cov, center = center, n.obs = n)
    if (with.wt) 
        y$wt <- wt
    if (cor) {
        Is <- 1/sqrt(diag(cov))
        R <- cov
        R[] <- Is * cov * rep(Is, each = nrow(cov))
        y$cor <- R
    }
    y
}
<bytecode: 0x7fa3526d4b58>
<environment: namespace:stats>

Note the following:

The first thing to pay attention to is that the function can compute the weighted covariance matrix in two ways:
- unbiased, using corssprod(x) / (1 - sum(wt^2))
- ML (or maximum likelihood), using corssprod(x)
Note that the wt object is divided by the sum of the values it is storing, which amounts to normalising the weights. This happens with wt <- wt/s with s being created inside an if statement as s <- sum(wt).
x is centered on the normalized weigthed means using the sweep function
x is weighted by multiplying by sqrt(wt)

Reproducing the internal steps

First, we’ll set up a few objects we need to replicate manually what happens inside the stats::cov.wt() function. We need to define a dataset, a vector of weights, a method to compute descriptives, and based on these we will also create an object to store the number of rows (n). As a vector of weights we sample random values between 0 and 1. We can think of this as an attempt to weight each observation for the probability of sampling them from a population.

# Set up manual computation of cov.wt() ----------------------------------------

  # Assign values to the function arguments
  x      <- xy                     # data
  set.seed(20220314)
  wi     <- runif(length(wi), min = 0, max = 1)
  method <- "ML"                   # use Maximum Likelihood for estimation

  # Assign values to some of the internal objects
  n <- nrow(x)

Next, we want to make sure we normalize the weights. In other words we want to make sure the weights sum to 1.

# Normalize weights ------------------------------------------------------------

  # Normalise weights (to sum to 1)
  wn <- wi / sum(wi)

  # Check they sum to 1
  sum(wn) == 1

[1] TRUE

Then, we want to compute the vector of weighted means.

# Compute the weighted means ---------------------------------------------------

  # Center on weighted mean if required
  center <- colSums(wn * x)

  # Center X on the weigthed mean
  x_cent <- sweep(x, 2, center, check.margin = FALSE)

  # Note that the sweep is subtracting the "center" to each value
  all.equal(
    sweep(x, 2, center, check.margin = FALSE),
    t(apply(x, 1, function (i) i - center))
  )

[1] TRUE

Note that the weighted mean is computed as: [ {x} = _{i = 1}^{n} w_i x_i ] and that center <- colSums(wn * x) is doing exactly this.

Finally, we want to compute the weighted covariance matrix.

# Compute the weighted covariance matrix ---------------------------------------

  # Weight (centered) data
  x_weighted <- sqrt(wn) * x_cent

  # Compute the ML weigthed covariance matrix manually
  covwt_man <- crossprod(x_weighted)

  # Print the manual weigthed covariance matrix
  covwt_man

         x        y
x 7.102015 5.431419
y 5.431419 6.210209

  # Compute the ML weigthed covariance matrix with stats::cov.wt()
  covwt_stats <- cov.wt(xy, wt = wi, method = "ML", center = TRUE)$cov

  # Compare manual and stats weigthed covariance matrices
  covwt_man - covwt_stats

  x y
x 0 0
y 0 0

Mathematical formula and alternative R computations

Unbiased weighted covariance matrix

For a given population covariance matrix \(Q\), each element \(q_{ik}\) of the unbiased estimation of the weighted covariance matrix \(\hat{Q}\) can be computed with the following formula:

[ q_{ik} = {i = 1}^{n} w_i (x{ij} - {x}j) (x{ij} - {x}_k) ]

with \(\bar{x}_j\) being the weighted mean for variable \(j\), and \(w_i\) being the normalized weight for a given observation (which we store in the vector wn). The following are alternative ways of computing it with mathematical or R shortcuts:

# Alternative computations of the unbiased weighted covariance mat -------------

  # Literal translation of equation
  1 / (1 - sum(wn^2)) * t(wn * x_cent) %*% (x_cent)

         x        y
x 8.320767 6.363485
y 6.363485 7.275922

  # Rearrange denominator
  t(wn * x_cent) %*% (x_cent) / (1 - sum(wn^2))

         x        y
x 8.320767 6.363485
y 6.363485 7.275922

  # Spread wn
  t(sqrt(wn) * x_cent) %*% (sqrt(wn)*x_cent) / (1 - sum(wn^2))

         x        y
x 8.320767 6.363485
y 6.363485 7.275922

  # Replace manual cross-product with R cross-product
  crossprod(sqrt(wn) * x_cent)/(1 - sum(wn^2))

         x        y
x 8.320767 6.363485
y 6.363485 7.275922

  # Compute with stats::cov.wt()
  cov.wt(xy, wt = wi, method = "unbiased", center = TRUE)$cov

         x        y
x 8.320767 6.363485
y 6.363485 7.275922

Maximum Likelihood weighted covariance matrix

Each element \(q_{ik}\) of the maximum likelihood weighted covariance matrix estimate \(\hat{Q}\) can be computed manually with the following formula: