简体   繁体   English

如何用 R 确定加权人口比例的置信区间

[英]How to Determine the Confidence Interval for a WEIGHTED Population Proportion with R

TASK: COMPUTE THE WEIGHTED SHARES OF MANIFESTATIONS OF A CATEGORICAL VARIABLE AND FIND CONFIDENCE INTERVALS AROUND THOSE WEIGHTED SHARES任务:计算分类变量表现形式的加权份额,并找到这些加权份额的置信区间

library(dplyr)
set.seed(100)

Make up data set with a categorical variable and a weight variable:用分类变量和权重变量组成数据集:

df <- data.frame(
  Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
  Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)

Have a quick look at the data快速查看数据

head(df, 10)
tail(df, 10)

Now, I CAN complete the task without taking the weights into account:现在,我可以在不考虑权重的情况下完成任务:

Write function that returns the UNWEIGHTED share of one manifestation of the categorical variable together with its 95% confidence interval (for general information regarding the determination of confidence intervals for a population proportion, see eg: https://www.dummies.com/education/math/statistics/how-to-determine-the-confidence-interval-for-a-population-proportion/ )编写函数,返回分类变量的一种表现形式的未加权份额及其 95% 置信区间(有关确定总体比例置信区间的一般信息,请参见例如: https : //www.dummies.com/education /math/statistics/how-to-determine-the-confidence-interval-for-a-population-proportion/ )

ci.share <- function(category, manifestation){
  n = length(category)
  share = length(which(category == manifestation)) / n
  se = sqrt (share*(1-share) / n) 

  if(n*share*(1-share) >= 9){
    U <- share - 1.96*se
    O <- share + 1.96*se
    KI <- c(U, share, O)
    names(KI) <- c("lower boundary", "share", "upper boundary")
    KI = KI * 100
    return(KI)
  } else {return("Error, because n*p*(1-p) < 9")}
}

Utilize function for all manifestations and store results in list:利用所有表现形式的功能并将结果存储在列表中:

cis <- list()
for(i in c("A", "B", "C", "D")){
  cis[[i]] <- ci.share(category = df$Category, manifestation = i)
}

Make results easily readable:使结果易于阅读:

cis <- t(sapply(cis, "["))
cis <- round(cis, digits = 2)
cis  #TASK DONE

THE QUESTION IS: HOW TO GET THE EQUIVALENT FOR "cis" CONSIDERING THE WEIGHTS问题是:考虑到权重,如何获得“cis”的等价物

What I can do, is finding the weighted shares:我能做的是找到加权份额:

ws <- summarise_at(group_by(df, Category), vars(Weight), funs(sum))
ws[,2] <- (ws[,2]/sum(df$Weight)) * 100
names(ws) <- c("Category", "Weighted_Share")
sum(ws$Weighted_Share) # Should be 100, is 100
ws

But how to get the Confidence Intervals now?但是现在如何获得置信区间?

Would be very grateful for a solution.将非常感谢解决方案。 Thanks in advance!提前致谢! Andi和我

For comparison purposes with the survey-based calculation of the proportion Confidence Intervals (CI), I here post the code to compute the CIs using the Central Limit Theorem (CLT) for non-identically distributed variables (as requested by the Original Poster (OP) in his comments):为了与基于调查的比例置信区间 (CI) 计算进行比较,我在这里发布了使用中心极限定理 (CLT) 计算非相同分布变量的 CI 的代码(根据原始海报 (OP) 的要求) ) 在他的评论中):

1) Helper functions definition 1) 辅助函数定义

#' Compute survey-based Confidence Intervals
#'
#' @param df data frame with at least one column: "Category".
#' @param design survey design, normally created with \code{svydesign()} of the survey package.
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using the \code{svymean()} function of the survey package on the given \code{design}.
#' @return data frame containing estimated proportions for each "Category" and respective
#' 95% confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_survey <- function(df, design) {
  design.mean = svymean(~Category, design, deff="replace")
  CI = confint( design.mean )
  # Add the estimated proportions and Delta = Semi-size CI for comparison with the CLT-based results below
  CI = cbind( p=design.mean, Delta=( CI[,"97.5 %"] - CI[,"2.5 %"] ) / 2, CI )

  return(as.data.frame(CI))
}

#' Compute CLT-based Confidence Intervals
#'
#' @param df data frame with at least two columns: "Category", "Weight" (sampling weights)
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using Lyapunov's or Lindenberg's version of the Central Limit Theorem (CLT) which
#' applies to independent but NOT identically distributed random variables
#' Ref: https://en.wikipedia.org/wiki/Central_limit_theorem#Lack_of_identical_distribution
#' @return data frame containing estimated proportions for each "Category" and respective
#' \code{ci_level} confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_clt <- function(df, ci_level) {
  ### The following calculations are based on the following setup for each category given in 'df':
  ### 1) Let X(i) be the measurement of variable X for sampled case i, i = 1 ... n (n=500 in this case)
  ### where X is a 0/1 variable indicating absence or presence of a selected category.
  ### From the X(i) samples we would like to estimate the
  ### true proportion p of the presence of the category in the population.
  ### Therefore X(i) are iid random variables with Binomial(1,p) distribution
  ###
  ### 2) Let Y(i) = w(i)*X(i)
  ### where w(i) is the sampling weight applied to variable X(i).
  ###
  ### We apply the CLT to the sum of the Y(i)'s, using:
  ### - E(Y(i)) = mu(i) = w(i) * E(X(i)) = w(i) * p (since w(i) is a constant and the X(i) are identically distributed)
  ### - Var(Y(i)) = sigma2(i) = w(i)^2 * Var(X(i)) = w(i)^2 * p*(1-p) (since the X(i) iid)
  ###
  ### Hence, by CLT:
  ###   Sum{Y(i) - mu(i)} / sigma -> N(0,1)
  ### where:
  ###   sigma = sqrt( Sum{ sigma2(i) } ) = sqrt( Sum{ w(i)^2 } ) * sqrt( p*(1-p) )
  ### and note that:
  ###   Sum{ mu(i) } = Sum{ w(i) } * p = n*p
  ### since the sampling weights are assumed to sum up to the sample size.
  ###
  ### Note: all the Sums are from i = 1, ..., n
  ###
  ### 3) Compute the approximate confidence interval for p based on the N(0,1) distribution
  ### in the usual way, by first estimating sigma replacing p for the estimated p.
  ###

  alpha = 1 - ci_level                                         # area outside the confidence band
  z = qnorm(1 - alpha/2)                                       # critical z-quantile from Normal(0,1)

  n = nrow(df)                                                 # Sample size (assuming no missing values)
  ws = df$Weight / sum(df$Weight) * n                          # Weights scaled to sum the sample size (assumed for sampling weights)
  S = aggregate(ws ~ Category, sum, data=df)                   # Weighted-base estimate of the total by category (Sum{ Y(i) })
  sigma2 = sum( ws^2 )                                         # Sum of squared weights (note that we must NOT sum by category)
  S[,"p"] = S[,"ws"] / n                                       # Estimated proportion by category
  S[,"Delta"] =  z * sqrt( sigma2 ) *
                     sqrt( S$p * (1 - S$p) ) / n               # Semi-size of the CI by category
  LB_name = paste(formatC(alpha/2*100, format="g"), "%")       # Name for the CI's Lower Bound column
  UB_name = paste(formatC((1 - alpha/2)*100, format="g"), "%") # Name for the CI's Upper Bound column
  S[,LB_name] = S[,"p"] - S[,"Delta"]                          # CI's Lower Bound
  S[,UB_name] = S[,"p"] + S[,"Delta"]                          # CI's Upper Bound

  return(S)
}

#' Show the CI with the specified number of significant digits
show_values <- function(values, digits=3) {
  op = options(digits=digits)
  print(values)
  options(op)
}

2) Simulated Data 2) 模拟数据

### Simulated data
set.seed(100)
df <- data.frame(
  Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
  Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)

3) Survey-based CI (for reference) 3)基于调查的CI(供参考)

# Computation of CI using survey sampling theory (implemented in the survey package)
library(survey)
design <- svydesign(ids = ~1, weights = ~Weight, data = df)
CI_survey = ci_survey(df, design)
show_values(CI_survey)

which gives:这使:

               p  Delta  2.5 % 97.5 %
CategoryA 0.0896 0.0268 0.0628  0.116
CategoryB 0.2119 0.0417 0.1702  0.254
CategoryC 0.2824 0.0445 0.2379  0.327
CategoryD 0.4161 0.0497 0.3664  0.466

4) CLT-based CI 4)基于CLT的CI

The description of the method used is included in the comments at the top of the ci_clt() function defined above.所用方法的描述包含在上面定义的ci_clt()函数顶部的注释中。

# Computation of CI using the Central Limit Theorem for non-identically distributed variables
CI_clt = ci_clt(df, ci_level=0.95)
show_values(CI_clt)

which gives:这使:

  Category    ws      p  Delta 2.5 % 97.5 %
1        A  44.8 0.0896 0.0286 0.061  0.118
2        B 106.0 0.2119 0.0410 0.171  0.253
3        C 141.2 0.2824 0.0451 0.237  0.328
4        D 208.0 0.4161 0.0494 0.367  0.465

5) Comparison of CI sizes 5)CI大小比较

Here we compute the ratio between the CLT-based CIs and the survey-based CIs.在这里,我们计算基于 CLT 的 CI 和基于调查的 CI 之间的比率。

# Comparison of CI size
show_values(
  data.frame(Category=CI_clt[,"Category"],
             ratio_DeltaCI_clt2survey=CI_clt[,"Delta"] / CI_survey[,"Delta"])
)

which gives:这使:

  Category ratio_DeltaCI_clt2survey
1        A                    1.067
2        B                    0.982
3        C                    1.013
4        D                    0.994

Since all the ratios are close to 1, we conclude that the CI sizes from the two methods are very similar!由于所有比率都接近于 1,我们得出结论,这两种方法的 CI 大小非常相似!

6) Check that the CLT-based implementation for the CI seems to be correct 6) 检查 CI 的基于 CLT 的实现是否正确

An convenient check to perform on the CLT-based calculation of the CIs is to run the calculation on the Simple Random Sample (SRS) case and verify that the results coincide with those given by the svymean() calculation under the SRS design.对基于 CLT 的 CI 计算执行的一种方便检查是在简单随机样本 (SRS) 情况下运行计算,并验证结果与 SRS 设计下svymean()计算给出的结果一致。

# CLT-based calculation
df_noweights = df
df_noweights$Weight = 1                               # SRS: weights equal to 1
show_values( ci_clt(df_noweights, ci_level=0.95) )

which gives:这使:

  Category   p  Delta  2.5 % 97.5 %
1        A 0.1 0.0263 0.0737  0.126
2        B 0.2 0.0351 0.1649  0.235
3        C 0.3 0.0402 0.2598  0.340
4        D 0.4 0.0429 0.3571  0.443

which we compare to the survey-based calculation:我们将其与基于调查的计算进行比较:

# Survey-based calculation
design <- svydesign(ids=~1, probs=NULL, data=df)
show_values( ci_survey(df, design) )

that gives:这给出了:

            p  Delta  2.5 % 97.5 %
CategoryA 0.1 0.0263 0.0737  0.126
CategoryB 0.2 0.0351 0.1649  0.235
CategoryC 0.3 0.0402 0.2598  0.340
CategoryD 0.4 0.0430 0.3570  0.443

We see that the results coincide, suggesting that the implementation of the CLT-based calculation of the CI in the general case with unequal weights is correct.我们看到结果一致,表明在权重不等的一般情况下,基于 CLT 的 CI 计算的实现是正确的。

I finally found a very easy solution for my problem.我终于为我的问题找到了一个非常简单的解决方案。 The package "survey" does just what I want:包“调查”正是我想要的:

set.seed(100)
library(survey)

df <- data.frame(
  Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
  Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)

d <- svydesign(id = ~1, weights = ~Weight, data = df)
svymean(~Category, d)

The code results in the weighted proportions (same results as in "ws") and the corresponding standard errors, which makes it easy to calculate the confidence intervals.代码产生加权比例(与“ws”中的结果相同)和相应的标准误差,这使得计算置信区间变得容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM