[英]How to Perform Statistical Two-Sided Test for Independence (on Proportion) in R?
I am trying to compare two percentages/proportions for statistical significance in R, using a Chi-Square test.我正在尝试使用卡方检验比较 R 中两个百分比/比例的统计显着性。 I am familiar with a SAS method for Chi Square in which I supply a dataset column for a numerator, another column for denominator, and a categorical variable to distinguish distributions (A/B).我熟悉卡方的 SAS 方法,其中我为分子提供数据集列,为分母提供另一列,以及用于区分分布 (A/B) 的分类变量。
However I am getting unexpected values in R using some examples sets.但是,我使用一些示例集在 R 中获得了意想不到的值。 When I test two similar populations, with low sample sizes, I am getting p-values of (approximately) zero, where I would expect the p-values to be very high (~ 1).当我测试两个具有低样本量的相似总体时,我得到的 p 值(大约)为零,我希望 p 值非常高(~ 1)。
My test set is below, where I went with sugar content in a batch of water: eg "does group A use the same ratio of sugar as group B?".我的测试集如下,我用一批水中的糖含量进行了测试:例如“A 组使用的糖的比例是否与 B 组相同?”。 My actual problem is similar, where this isn't a pass-fail type test and the numerator and denominator values can vary wildly between samples (different sugar and/or water weights per sample).我的实际问题是类似的,这不是通过-失败类型测试,并且分子和分母值在样本之间可能会有很大差异(每个样本的糖和/或水的重量不同)。 My first objective is to verify that I can get a high p-value from two similar sets.我的第一个目标是验证我可以从两个相似的集合中获得高 p 值。 The next question is, at what sample size does the p-value become low enough to indicate significance?下一个问题是,在多大的样本量下,p 值会变得足够低以表明显着性?
# CREATE 2 NEARLY-EQUAL DISTRIBUTIONS (EXPECTING HIGH P-VALUE FROM PROP.TEST)
set.seed(108)
group_A = tibble(group = "A", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
group_B = tibble(group = "B", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
batches <- rbind(group_A, group_B)
I then do a summarize to calculate the overall sugar percentage tendency between groups:然后我做一个总结来计算各组之间的总体糖百分比趋势:
# SUMMARY TOTALS
totals <- batches %>%
group_by(group) %>%
summarize(batch_count = n(),
batch_lbs_sum = sum(batch_lbs),
sugar_lbs_sum = sum(sugar_lbs),
sugar_percent_overall = sugar_lbs_sum / batch_lbs_sum) %>%
glimpse()
I then supply the sugar percentage between groups to a prop.test, expecting a high p-value然后我将组之间的糖百分比提供给 prop.test,期望高 p 值
# ADD P-VALUE & CONFIDENCE INTERVAL
stats <- totals %>%
rowwise() %>%
summarize(p_val = prop.test(x = sugar_percent_overall, n = batch_count, conf.level = 0.95, alternative = "two.sided")$p.value) %>%
mutate(p_val = round(p_val, digits = 3)) %>%
mutate(conf_level = 1 - p_val) %>%
select(p_val, conf_level) %>%
glimpse()
# FINAL SUMMARY TABLE
cbind(totals, stats) %>%
glimpse()
Unforunately the final table gives me a p-value of 0, suggesting the two nearly-identical sets are independent/different.不幸的是,决赛桌给了我 0 的 p 值,这表明两个几乎相同的集合是独立的/不同的。 Shouldn't I get a p-value of ~1?我不应该得到 ~1 的 p 值吗?
Observations: 2
Variables: 7
$ group <chr> "A", "B"
$ batch_count <int> 50, 50
$ batch_lbs_sum <dbl> 1475.579, 1475.547
$ sugar_lbs_sum <dbl> 495.4983, 484.6928
$ sugar_percent_overall <dbl> 0.3357992, 0.3284833
$ p_val <dbl> 0, 0
$ conf_level <dbl> 1, 1
From another angle, I also tried to compare the recommended sample size from power.prop.test with an actual prop.test using this recommended sample size.从另一个角度来看,我还尝试将 power.prop.test 中推荐的样本量与使用此推荐样本量的实际 prop.test 进行比较。 This gave me the reverse problem -- I was a expecting low p-value, since I am using the recommended sample size, but instead get a p-value of ~1.这给了我相反的问题——我期望低 p 值,因为我使用了推荐的样本大小,但得到了 ~1 的 p 值。
# COMPARE PROP.TEST NEEDED COUNTS WITH AN ACTUAL PROP.TEXT
power.prop.test(p1 = 0.33, p2 = 0.34, sig.level = 0.10, power = 0.80, alternative = "two.sided") ## n = 38154
prop.test(x = c(0.33, 0.34), n = c(38154, 38154), conf.level = 0.90, alternative = "two.sided") ## p = 1 -- shouldn't p be < 0.10?
Am I using prop.test wrong or am I misinterpreting something?我使用 prop.test 是错误还是我误解了什么? Ideally, I would prefer to skip the summarize step and simply supply the dataframe, the numerator column 'sugar_lbs', and the denominator 'batch_lbs' as I do in SAS -- is this possible in R?理想情况下,我更愿意跳过汇总步骤并简单地提供数据框、分子列“sugar_lbs”和分母“batch_lbs”,就像我在 SAS 中所做的那样——这在 R 中可能吗?
(Apologies for any formatting issues as I'm new to posting) (对于我刚开始发帖时出现的任何格式问题,我深表歉意)
I think my choice of using normal distributions may have distracted from the original question.我认为我选择使用正态分布可能分散了原始问题的注意力。 I found an example that gets to the heart of what I was trying to ask, which is how to use prop test given only a proportion/percentage and the sample size.我找到了一个例子,它触及了我想要问的问题的核心,即如何仅在给定比例/百分比和样本大小的情况下使用道具测试。 Instead of city_percent
and city_total
below, I could simply rename these to sugar_percent
and batch_lbs
.而不是下面的city_percent
和city_total
,我可以简单地将它们重命名为sugar_percent
和batch_lbs
。 I think this reference answers my question, where prop.test appears to be the correct test to use.我认为这个参考回答了我的问题,其中 prop.test 似乎是要使用的正确测试。
My actual problem has an extremely non-normal distribution, but is not easily replicated via code.我的实际问题具有极其非正态分布,但不容易通过代码复制。
df <- tibble(city = c("Atlanta", "Chicago", "NY", "SF"), washed = c(1175, 1329, 1169, 1521), not_washed = c(413, 180, 334, 215)) %>%
mutate(city_total = washed + not_washed,
city_percent = washed / city_total) %>%
select(-washed, -not_washed) %>%
glimpse()
# STANFORD CALCULATION (p = 7.712265e-35)
pchisq(161.74, df = 3, lower.tail = FALSE)
# PROP TEST VERSION (SAME RESULT, p = 7.712265e-35)
prop.test(x = df$city_percent * df$city_total, n = df$city_total, alternative = "two.sided", conf.level = 0.95)$p.value
The documentation for prop.test
says: prop.test
的文档说:
Usage
prop.test(x, n, p = NULL, alternative = c("two.sided", "less", "greater"), conf.level = 0.95, correct = TRUE)
用法prop.test(x, n, p = NULL, alternative = c("two.sided", "less", "greater"), conf.level = 0.95, correct = TRUE)
Arguments参数
x
a vector of counts of successes , a one-dimensional table with two entries, or a two-dimensional table (or matrix) with 2 columns, giving the counts of successes and failures, respectively.x
成功计数向量、具有两个条目的一维表或具有 2 列的二维表(或矩阵),分别给出成功和失败的计数。
n
a vector of counts of trials;n
试验计数向量; ignored if x is a matrix or a table.如果 x 是矩阵或表,则忽略。
So if you want a "correct" test, you would have to use sugar_lbs_sum
as the x
instead of sugar_percent_overall
.因此,如果您想要“正确”的测试,则必须使用sugar_lbs_sum
作为x
而不是sugar_percent_overall
。 You should still receive some kind of warning that the x
is non-integral, but that's not my major concern.您仍然应该收到某种警告,指出x
是非整数,但这不是我主要关心的问题。
But from a statistical perspective this is the complete wrong way of doing things.但从统计的角度来看,这是完全错误的做事方式。 You are directly causing spurious correlation for a testing of difference between two quantities by dividing by their sum arbitrarily.通过任意除以它们的总和,您直接导致了对两个数量之间差异的测试的虚假相关性。 If the samples ( sugar_lbs_sum
) are independent, but you divide by their sums, you have made the ratios dependent.如果样本 ( sugar_lbs_sum
) 是独立的,但您除以它们的总和,则您已使比率相关。 This violates the assumptions of the statistical test in a critical way.这以一种批判的方式违反了统计检验的假设。 Kronmal 1993 "Spurious correlation and the fallacy of the ratio" covers this. Kronmal 1993“虚假相关性和比率谬误”涵盖了这一点。
The data you generated are independent normal, so don't sum them, rather test for a difference with the t-test.您生成的数据是独立正态的,因此不要将它们相加,而是测试与 t 检验的差异。
The Stanford link I added to my original post answered my question.我添加到原始帖子中的斯坦福链接回答了我的问题。 I modified the Stanford example to simply rename the variables from city
to group
, and washed
counts to sugar_lbs
.我修改了斯坦福的例子,简单地将变量从city
重命名为group
,并将计数washed
为sugar_lbs
。 I also doubled one batch, (or comparing a small versus large city).我还将一批翻了一番(或比较小城市与大城市)。 I now get the expected high p-value (0.65) indicating that there is no statistical significance that the proportions are different.我现在得到预期的高 p 值 (0.65),表明比例不同没有统计显着性。
When I add more groups (for more degrees of freedom) and continue to vary batch sizes proportionally, I continue to get high p-values as expected, confirming the recipe is the same.当我添加更多组(以获得更多自由度)并继续按比例改变批次大小时,我继续按预期获得高 p 值,确认配方相同。 If I modify the sugar percent of any one group, the p-value immediately drops to zero indicating one of the groups is different, as expected.如果我修改任何一组的糖百分比,p 值会立即降至零,表明其中一组是不同的,正如预期的那样。
Finally, when doing the prop.text within a 'dplyr' pipe, I found I should not have used the rowwise() step, which causes my p-values to fall to zero.最后,在“dplyr”管道中执行 prop.text 时,我发现我不应该使用 rowwise() 步骤,这会导致我的 p 值降至零。 Removing this step gives the correct p-value.删除此步骤可得到正确的 p 值。 The only downside is that I don't yet know which group is different until I compare only 2 groups at a time iteratively.唯一的缺点是我不知道哪一组是不同的,直到我一次只迭代比较两组。
#---------------------------------------------------------
# STANFORD EXAMPLE - MODIFIED TO SUGAR & ONE DOUBLE BATCHED
#--------------------------------------------------------
df <- tibble(group = c("A", "B"), sugar_lbs = c(495.5, 484.7), water_lbs = c(1475.6 - 495.5, 1475.6 - 484.7)) %>%
mutate(sugar_lbs = ifelse(group == "B", sugar_lbs * 2, sugar_lbs),
water_lbs = ifelse(group == "B", water_lbs * 2, water_lbs)) %>%
mutate(batch_lbs = sugar_lbs + water_lbs,
sugar_percent = sugar_lbs / batch_lbs) %>%
glimpse()
sugar_ratio_all <- sum(df$sugar_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
water_ratio_all <- sum(df$water_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
dof <- (2 - 1) * (length(df$group) - 1)
df <- df %>%
mutate(sugar_expected = (sugar_lbs + water_lbs) * sugar_ratio_all,
water_expected = (sugar_lbs + water_lbs) * water_ratio_all) %>%
mutate(sugar_chi_sq = (sugar_lbs - sugar_expected)^2 / sugar_expected,
water_chi_sq = (water_lbs - water_expected)^2 / water_expected) %>%
glimpse()
q <- sum(df$sugar_chi_sq) + sum(df$water_chi_sq)
# STANFORD CALCULATION
pchisq(q, df = dof, lower.tail = F)
# PROP TEST VERSION (SAME RESULT)
prop.test(x = df$sugar_percent * df$batch_lbs, n = df$batch_lbs, alternative = "two.sided", conf.level = 0.95)$p.value
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.