In a sample made of groups of different sizes, descriptive statistics like the mean and the covariance between variables can be computed by assigning proper weights to account for the difference in group sizes. Wights are generally normalized (i.e., \(\sum_{i = 1}^{n} w_i = 1\)).
Learn by coding
Example
Now, let’s consider a very simple example. Say that you have a dataset with two variables and that you have a vector of weights defining how important each observation should be.
# Initial simple example -------------------------------------------------------# Get the dataset used in the example of stats::cov.wt() xy <-cbind(x =1:10, y =c(1:3, 8:5, 8:10))# Define non-negative weights (as in example of stats::cov.wt()) wi <-c(0,0,0,1,1,1,1,1,0,0)# Get the weighted estimate with the default methods covwt_stats <- stats::cov.wt(xy, wt = wi) # i.e. method = "unbiased"# Compare unweighted and weighted meansdata.frame(uw =colMeans(xy),select =colMeans(xy[wi ==1, ]),wg = covwt_stats$center)
uw select wg
xx 9.166667 2.5 2.5
xy 8.055556 -0.5 -0.5
yx 8.055556 -0.5 -0.5
yy 9.433333 1.7 1.7
Note how by weighting with a vector of 0 and 1s we are basically saying that the observations with a 0 will be excluded from the count. They are weighted to have 0 impact on the computation of the descriptive statistics. This is clear when you compare the results of the select and wg columns.
Computing the weighted covariance matrix manually
We could replicate the results of the weighting simply by selecting a subset of the original data because all observations were either weighted 0 or equally (1). When this is not the case, weighting is slightly more complicated.
Exploring the stats::cov.wt() function code
Let’s look at how the cov.wt() function works more in depth. The internal code of the function is the following:
# Examine the internal code of stats::cov.wt() --------------------------------- cov.wt
function (x, wt = rep(1/nrow(x), nrow(x)), cor = FALSE, center = TRUE,
method = c("unbiased", "ML"))
{
if (is.data.frame(x))
x <- as.matrix(x)
else if (!is.matrix(x))
stop("'x' must be a matrix or a data frame")
if (!all(is.finite(x)))
stop("'x' must contain finite values only")
n <- nrow(x)
if (with.wt <- !missing(wt)) {
if (length(wt) != n)
stop("length of 'wt' must equal the number of rows in 'x'")
if (any(wt < 0) || (s <- sum(wt)) == 0)
stop("weights must be non-negative and not all zero")
wt <- wt/s
}
if (is.logical(center)) {
center <- if (center)
colSums(wt * x)
else 0
}
else {
if (length(center) != ncol(x))
stop("length of 'center' must equal the number of columns in 'x'")
}
x <- sqrt(wt) * sweep(x, 2, center, check.margin = FALSE)
cov <- switch(match.arg(method), unbiased = crossprod(x)/(1 -
sum(wt^2)), ML = crossprod(x))
y <- list(cov = cov, center = center, n.obs = n)
if (with.wt)
y$wt <- wt
if (cor) {
Is <- 1/sqrt(diag(cov))
R <- cov
R[] <- Is * cov * rep(Is, each = nrow(cov))
y$cor <- R
}
y
}
<bytecode: 0x7fa3526d4b58>
<environment: namespace:stats>
Note the following:
The first thing to pay attention to is that the function can compute the weighted covariance matrix in two ways:
unbiased, using corssprod(x) / (1 - sum(wt^2))
ML (or maximum likelihood), using corssprod(x)
Note that the wt object is divided by the sum of the values it is storing, which amounts to normalising the weights. This happens with wt <- wt/s with s being created inside an if statement as s <- sum(wt).
x is centered on the normalized weigthed means using the sweep function
x is weighted by multiplying by sqrt(wt)
Reproducing the internal steps
First, we’ll set up a few objects we need to replicate manually what happens inside the stats::cov.wt() function. We need to define a dataset, a vector of weights, a method to compute descriptives, and based on these we will also create an object to store the number of rows (n). As a vector of weights we sample random values between 0 and 1. We can think of this as an attempt to weight each observation for the probability of sampling them from a population.
# Set up manual computation of cov.wt() ----------------------------------------# Assign values to the function arguments x <- xy # dataset.seed(20220314) wi <-runif(length(wi), min =0, max =1) method <-"ML"# use Maximum Likelihood for estimation# Assign values to some of the internal objects n <-nrow(x)
Next, we want to make sure we normalize the weights. In other words we want to make sure the weights sum to 1.
# Normalize weights ------------------------------------------------------------# Normalise weights (to sum to 1) wn <- wi /sum(wi)# Check they sum to 1sum(wn) ==1
# Compute the weighted means ---------------------------------------------------# Center on weighted mean if required center <-colSums(wn * x)# Center X on the weigthed mean x_cent <-sweep(x, 2, center, check.margin =FALSE)# Note that the sweep is subtracting the "center" to each valueall.equal(sweep(x, 2, center, check.margin =FALSE),t(apply(x, 1, function (i) i - center)) )
[1] TRUE
Note that the weighted mean is computed as: [ {x} = _{i = 1}^{n} w_i x_i ] and that center <- colSums(wn * x) is doing exactly this.
# Compute the weighted covariance matrix ---------------------------------------# Weight (centered) data x_weighted <-sqrt(wn) * x_cent# Compute the ML weigthed covariance matrix manually covwt_man <-crossprod(x_weighted)# Print the manual weigthed covariance matrix covwt_man
x y
x 7.102015 5.431419
y 5.431419 6.210209
# Compute the ML weigthed covariance matrix with stats::cov.wt() covwt_stats <-cov.wt(xy, wt = wi, method ="ML", center =TRUE)$cov# Compare manual and stats weigthed covariance matrices covwt_man - covwt_stats
x y
x 0 0
y 0 0
Mathematical formula and alternative R computations
Unbiased weighted covariance matrix
For a given population covariance matrix \(Q\), each element \(q_{ik}\) of the unbiased estimation of the weighted covariance matrix \(\hat{Q}\) can be computed with the following formula:
with \(\bar{x}_j\) being the weighted mean for variable \(j\), and \(w_i\) being the normalized weight for a given observation (which we store in the vector wn). The following are alternative ways of computing it with mathematical or R shortcuts:
# Alternative computations of the unbiased weighted covariance mat -------------# Literal translation of equation1/ (1-sum(wn^2)) *t(wn * x_cent) %*% (x_cent)
with \(\bar{x}_j\) being the weighted mean for variable \(j\), and \(w_i\) being the normalized weight for a given observation. The following are alternative ways of computing it with mathematical or R shortcuts:
# Alternative computations of the ML weighted covariance mat -------------------# R manual cross-product using un-normalised weights1/sum(wi) *t(wi * x_cent) %*% (x_cent)
x y
x 7.102015 5.431419
y 5.431419 6.210209
# Using the normalised weights1/sum(wn) *t(wn * x_cent) %*% (x_cent)
# Replace manual cross-product with R cross-productcrossprod(sqrt(wn) * x_cent)
x y
x 7.102015 5.431419
y 5.431419 6.210209
# R cross-product matrixcrossprod(x_weighted)
x y
x 7.102015 5.431419
y 5.431419 6.210209
# Compute with stats::cov.wt()cov.wt(xy, wt = wi, method ="ML", center =TRUE)$cov
x y
x 7.102015 5.431419
y 5.431419 6.210209
Relationship with the matrix of sufficient statistics
Note the relationship between the covariance matrix and the matrix of sufficient statistics is the same as in the unweighted case: \(\text{cov} = \frac{T_{\text{obs}}}{n}\).
# Obtain the matrix of sufficient statistics Tobs ------------------------------# Define a new weigthing objectset.seed(20220314) wi <-runif(length(wi), min =0, max =1) wn <- wi /sum(wi)# Compute the weighted means of X again center <-colSums(wn * x) x_cent <-sweep(x, 2, center, check.margin =FALSE)# "Effective" sample size n <-sum(wi)# Number of columns p <-ncol(x)# Obtain matrix of sufficient statistics (Tobs) Tobs_lopp <-matrix(0, p, p)for(i in1:nrow(x)){ Tobs_lopp <- Tobs_lopp + wi[i] * (x_cent[i, ]) %*%t(x_cent[i, ]) }# Obtain matrix of sufficient statistics (Tobs) w/ cross-product shortcut Tobs_cp <-t(wi * x_cent) %*% x_cent# Compare loop version and cross-product shortcut Tobs_lopp - Tobs_cp
x y
[1,] 0 3.552714e-15
[2,] 0 0.000000e+00
# Assign simpler name and print Tobs (Tobs <- Tobs_cp)
x y
x 31.08417 23.77229
y 23.77229 27.18091
# Convert to a covariance matrix covmat <- Tobs / n# Check it's what you were expecting covmat -cov.wt(xy, wt = wi, method ="ML", center =TRUE)$cov
x y
x 0 -8.881784e-16
y 0 -8.881784e-16
Note the following:
we are using the normalized weights wn to center the data, but we are using the un-normalised weights to scale the data contribution to Tobs
if we had used the normalized weights, \(n\) would have been equal to 1 and covmat would be equal to Tobs.
# Obtain the matrix of sufficient statistics Tobs (normalised weights) ---------# Convert to a covariance matrix covmat -t(wn * x_cent) %*% x_cent
x y
x 0.000000e+00 0
y 8.881784e-16 0
# Then, covmat relates to Tobs as (t(wn * x_cent) %*% x_cent * n) - Tobs_cp
x y
x 0.000000e+00 0
y -3.552714e-15 0
# So we could say Tobs <-t(wn * x_cent) %*% x_cent * n
TL;DR, just give me the code!
# Initial simple example -------------------------------------------------------# Get the dataset used in the example of stats::cov.wt() xy <-cbind(x =1:10, y =c(1:3, 8:5, 8:10))# Define non-negative weights (as in example of stats::cov.wt()) wi <-c(0,0,0,1,1,1,1,1,0,0)# Get the weighted estimate with the default methods covwt_stats <- stats::cov.wt(xy, wt = wi) # i.e. method = "unbiased"# Compare unweighted and weighted meansdata.frame(uw =colMeans(xy),select =colMeans(xy[wi ==1, ]),wg = covwt_stats$center)# Compare unweighted and weighted covariance matrixdata.frame(uw =c(cov(xy)),select =c(cov(xy[wi ==1, ])),wg =c(covwt_stats$cov),row.names =c(sapply(colnames(cov(xy)), paste0, rownames(cov(xy)))) )# Examine the internal code of stats::cov.wt() --------------------------------- cov.wt# Set up manual computation of cov.wt() ----------------------------------------# Assign values to the function arguments x <- xy # dataset.seed(20220314) wi <-runif(length(wi), min =0, max =1) method <-"ML"# use Maximum Likelihood for estimation# Assign values to some of the internal objects n <-nrow(x)# Normalize weights ------------------------------------------------------------# Normalise weights (to sum to 1) wn <- wi /sum(wi)# Check they sum to 1sum(wn) ==1# Compute the weighted means ---------------------------------------------------# Center on weighted mean if required center <-colSums(wn * x)# Center X on the weigthed mean x_cent <-sweep(x, 2, center, check.margin =FALSE)# Note that the sweep is subtracting the "center" to each valueall.equal(sweep(x, 2, center, check.margin =FALSE),t(apply(x, 1, function (i) i - center)) )# Compute the weighted covariance matrix ---------------------------------------# Weight (centered) data x_weighted <-sqrt(wn) * x_cent# Compute the ML weigthed covariance matrix manually covwt_man <-crossprod(x_weighted)# Print the manual weigthed covariance matrix covwt_man# Compute the ML weigthed covariance matrix with stats::cov.wt() covwt_stats <-cov.wt(xy, wt = wi, method ="ML", center =TRUE)$cov# Compare manual and stats weigthed covariance matrices covwt_man - covwt_stats# Alternative computations of the unbiased weighted covariance mat -------------# Literal translation of equation1/ (1-sum(wn^2)) *t(wn * x_cent) %*% (x_cent)# Rearrange denominatort(wn * x_cent) %*% (x_cent) / (1-sum(wn^2))# Spread wnt(sqrt(wn) * x_cent) %*% (sqrt(wn)*x_cent) / (1-sum(wn^2))# Replace manual cross-product with R cross-productcrossprod(sqrt(wn) * x_cent)/(1-sum(wn^2))# Compute with stats::cov.wt()cov.wt(xy, wt = wi, method ="unbiased", center =TRUE)$cov# Alternative computations of the ML weighted covariance mat -------------------# R manual cross-product using un-normalised weights1/sum(wi) *t(wi * x_cent) %*% (x_cent)# Using the normalised weights1/sum(wn) *t(wn * x_cent) %*% (x_cent)# Dropp the term = 1t(wn * x_cent) %*% (x_cent)# Spread wnt(sqrt(wn) * x_cent) %*% (sqrt(wn) * x_cent)# Replace manual cross-product with R cross-productcrossprod(sqrt(wn) * x_cent)# R cross-product matrixcrossprod(x_weighted)# Compute with stats::cov.wt()cov.wt(xy, wt = wi, method ="ML", center =TRUE)$cov# Obtain the matrix of sufficient statistics Tobs ------------------------------# Define a new weigthing objectset.seed(20220314) wi <-runif(length(wi), min =0, max =1) wn <- wi /sum(wi)# Compute the weighted means of X again center <-colSums(wn * x) x_cent <-sweep(x, 2, center, check.margin =FALSE)# "Effective" sample size n <-sum(wi)# Number of columns p <-ncol(x)# Obtain matrix of sufficient statistics (Tobs) Tobs_lopp <-matrix(0, p, p)for(i in1:nrow(x)){ Tobs_lopp <- Tobs_lopp + wi[i] * (x_cent[i, ]) %*%t(x_cent[i, ]) }# Obtain matrix of sufficient statistics (Tobs) w/ cross-product shortcut Tobs_cp <-t(wi * x_cent) %*% x_cent# Compare loop version and cross-product shortcut Tobs_lopp - Tobs_cp# Assign simpler name and print Tobs (Tobs <- Tobs_cp)# Convert to a covariance matrix covmat <- Tobs / n# Check it's what you were expecting covmat -cov.wt(xy, wt = wi, method ="ML", center =TRUE)$cov# Obtain the matrix of sufficient statistics Tobs (normalised weights) ---------# Convert to a covariance matrix covmat -t(wn * x_cent) %*% x_cent# Then, covmat relates to Tobs as (t(wn * x_cent) %*% x_cent * n) - Tobs_cp# So we could say Tobs <-t(wn * x_cent) %*% x_cent * n