简体   繁体   English

计算 dplyr 中分组数据的加权中位数的 95% 置信区间

[英]Calculating 95% confidence intervals for a weighted median over grouped data in dplyr

I have a dataset with several groups, where I want to calculate a median value for each group using dplyr.我有一个包含多个组的数据集,我想使用 dplyr 计算每个组的中值。 The data are weighted, and the weights need to be taken into account in calculating the median.数据是加权的,在计算中位数时需要考虑权重。 I found the weighted.median function from spatstat which seems to work fine.我从spatstat找到了weighted.median function 似乎工作正常。 Consider the following simplified example:考虑以下简化示例:

require(spatstat, dplyr)

tst <- data.frame(group = rep(c(1:5), each = 100))
tst$val = runif(500) * tst$group
tst$wt = runif(500) * tst$val

tst %>%
  group_by(group) %>%
  summarise(weighted.median(val, wt))

# A tibble: 5 × 2
  group `weighted.median(val, wt)`
  <int>                      <dbl>
1     1                      0.752
2     2                      1.36 
3     3                      1.99 
4     4                      2.86 
5     5                      3.45 

However, I would also like to add 95% confidence intervals to these values, and this has me stumped.但是,我还想为这些值添加 95% 的置信区间,这让我很困惑。 Things I've considered:我考虑过的事情:

  • Spatstat also has a weighted.var function but there's no documentation, and it's not even clear to me whether this is variance around the median or mean. Spatstat 还有一个weighted.var function 但没有文档,我什至不清楚这是中位数还是均值的方差。
  • This rcompanion post suggests various methods for calculating CIs around medians, but as far as I can tell none of them handle weights.这篇rcompanion 帖子提出了各种计算中位数周围 CI 的方法,但据我所知,它们都没有处理权重。
  • This blog post suggests a function for calculating CIs and a median for weighted data, and is the closest I can find to what I need.这篇博文建议使用 function 来计算 CI 和加权数据的中值,这是我能找到的最接近我需要的东西。 However, it doesn't work with my dplyr groupings.但是,它不适用于我的 dplyr 分组。 I suppose I could write a loop to do this one group at a time and build the output data frame, but that seems cumbersome.我想我可以编写一个循环来一次完成这一组并构建 output 数据帧,但这似乎很麻烦。 I'm also not totally sure I understand the function in the post and slightly suspicious of its results- for instance, testing this out I get wider estimates for alpha=0.1 than for alpha=0.05 , which seems backwards to me.我也不完全确定我理解帖子中的 function 并对它的结果有点怀疑 - 例如,测试它我得到的alpha=0.1alpha=0.05更广泛的估计,这对我来说似乎倒退了。 Edit to add: upon further investigation, I think this function works as intended if I use alpha=0.95 for 95% CIs, rather than alpha = 0.05 (at least, this returns values that feel intuitively about right).编辑补充:经过进一步调查,我认为如果我对 95% CI 使用alpha=0.95而不是alpha = 0.05 (至少,这会返回直观感觉正确的值),则此 function 会按预期工作。 I can also make it work with dplyr by editing to return just a single moe value rather than a pair of high/low estimates.我还可以通过编辑使其与 dplyr 一起使用,以仅返回单个 moe 值而不是一对高/低估计值。 So this may be a good option- but I'm also considering others.所以这可能是一个不错的选择——但我也在考虑其他人。

Is there an existing function in some library somewhere that can do what I want, or an otherwise straightforward way to implement this?在某个库中是否有一个现有的 function 可以做我想做的事情,或者其他直接的方法来实现这个?

There are several approaches.有几种方法。

You could use the asymptotic formula for standard error of the sample median.您可以使用渐近公式计算样本中位数的标准误差。 The sample median is asymptotically normal with standard error 1/sqrt(4 nf(m)) where n is the number of observations, m is the true median, and f(x) is the probability density of the (weighted) random variable.样本中位数是渐近正态的,标准误差为1/sqrt(4 nf(m)) ,其中n是观察数, m是真实中位数, f(x)是(加权)随机变量的概率密度。 You could estimate the probability density using the base R function density.default with the weights argument.您可以使用带有权weights参数的基本 R function density.default来估计概率密度。 If x is the vector of observed values and w the corresponding vector of weights, then如果x是观测值的向量, w是相应的权重向量,则

med <- weighted.median(x, w)
f <- density(x, weights=w)
fmed <- approx(f$x, f$y, xout=med)$y
samplesize <- length(x)
se <- 1/sqrt(4 * samplesize * fmed)
ci <- med + c(-1,1) * 1.96 * se

This relies on several asymptotic approximations so it may be inaccurate.这依赖于几个渐近近似,因此可能不准确。 Also the sample size depends on the interpretation of the weights.样本量也取决于权重的解释。 In some cases the sample size could be equal to sum(w).在某些情况下,样本大小可能等于 sum(w)。

If there is very little data in each group, you could use the even simpler normal reference approximation,如果每组中的数据很少,您可以使用更简单的正态参考近似值,

med <- weighted.median(x, w)
v <- weighted.var(x, w)
sdm <- sqrt(pi/2) * sqrt(v)
samplesize <- length(x)
se <- sdm/sqrt(samplesize)
ci <- med + c(-1,1) * 1.96 * se

Alternatively you could use bootstrapping - generate random resamples of the input data (by choosing random resamples of the indices 1, 2,..., n), extract the corresponding weighted observations (x_i, w_i), compute the weighted median of each resampled dataset, and construct the 95% confidence interval.或者,您可以使用自举 - 生成输入数据的随机重采样(通过选择索引 1、2、...、n 的随机重采样),提取相应的加权观测值(x_i,w_i),计算每个重采样的加权中位数数据集,并构建 95% 置信区间。 (This approach implicitly assumes the sample size is equal to n) (这种方法隐含地假设样本大小等于 n)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM