简体   繁体   English

使用dplyr对分组变量进行计数

[英]counts of grouped variables using dplyr

I would like to create a dataframe with confidence intervals for proportions as a final result. 我想创建一个带有比例置信区间的数据框作为最终结果。 I have introduced a variable (tp in my example) as a cut off value to calculate the proportions for. 我引入了一个变量(在我的示例中为tp)作为截止值来计算比例。 I would like to use the dplyr package to produce the final dataframe. 我想使用dplyr包来生成最终的数据帧。 Below is a simplified example: 下面是一个简化的示例:

library(dplyr)

my_names <- c("A","B")
dt <- data.frame(
  Z = sample(my_names,100,replace = TRUE),
  X = sample(1:10, replace = TRUE),
  Y = sample(c(0,1), 100, replace = TRUE)
)  
  my.df <- dt%>%  
    mutate(tp = (X >8)* 1) %>% #multiply by one to convert into numeric
    group_by(Z, tp) %>%
    summarise(n = n()) %>%
    mutate(prop.tp= n/sum(n)) %>%
    mutate(SE.tp = sqrt((prop.tp*(1-prop.tp))/n))%>%
    mutate(Lower_limit = prop.tp-1.96 * SE.tp)%>%
    mutate(Upper_limit = prop.tp+1.96 * SE.tp)

output:

Source: local data frame [4 x 7]
Groups: Z

  Z tp  n   prop.tp      SE.tp Lower_limit Upper_limit
1 A  0 33 0.6346154 0.08382498   0.4703184   0.7989123
2 A  1 19 0.3653846 0.11047236   0.1488588   0.5819104
3 B  0 27 0.5625000 0.09547033   0.3753782   0.7496218
4 B  1 21 0.4375000 0.10825318   0.2253238   0.6496762

However, I would like to calculate the Standard error and the CI:s using the total sample for the groups in column Z, not the splitted sample by the categorical variable tp. 但是,我想使用Z列中各组的总样本而不是分类变量tp拆分的样本来计算标准误差和CI:s。 So the total sample for A in my example should be n = 33 +19. 因此,在我的示例中,A的总样本应为n = 33 +19。 Any ideas? 有任何想法吗?

Not quite sure I get which group you want to compare with which here, but at any rate you have two grouping variables tp = X > 8 and Z . 我不太确定我要在此处与哪个组进行比较,但是无论如何,您都有两个分组变量tp = X > 8Z If you want to compare the rows with X > 8 and Z == "A" to all rows with X > 8 you can do it like this 如果要将X > 8Z == "A"行与X > 8所有行进行比较,则可以这样做

merge(
    dt %>%
        group_by(X > 8) %>%
        summarize(n.X = n()),
    dt %>%
        group_by(X > 8, Z) %>%
        summarise(n.XZ = n()),
    by = "X > 8"
) %>%
    mutate(prop.XZ = n.XZ/n.X) %>%
    mutate(SE = sqrt((prop.XZ*(1-prop.XZ))/n.X))%>%
    mutate(Lower_limit = prop.XZ-1.96 * SE) %>%
    mutate(Upper_limit = prop.XZ+1.96 * SE)
  X > 8 nX Z n.XZ prop.XZ SE Lower_limit Upper_limit 1 FALSE 70 A 37 0.5285714 0.05966378 0.4116304 0.6455124 2 FALSE 70 B 33 0.4714286 0.05966378 0.3544876 0.5883696 3 TRUE 30 A 16 0.5333333 0.09108401 0.3548087 0.7118580 4 TRUE 30 B 14 0.4666667 0.09108401 0.2881420 0.6451913 

If you want to turn the problem around and compare X > 8 and Z == "A" to all rows with Z == "A" you can do it like this 如果你想扭转这个问题,比较X > 8Z == "A"与所有行Z == "A" ,你可以像下面这样做

merge(
    dt %>%
        group_by(Z) %>%
        summarize(n.Z = n()),
    dt %>%
        group_by(X > 8, Z) %>%
        summarise(n.XZ = n()),
    by = "Z"
) %>%
    mutate(prop.XZ = n.XZ/n.Z) %>%
    mutate(SE = sqrt((prop.XZ*(1-prop.XZ))/n.Z))%>%
    mutate(Lower_limit = prop.XZ-1.96 * SE) %>%
    mutate(Upper_limit = prop.XZ+1.96 * SE)
  Z nZ X > 8 n.XZ prop.XZ SE Lower_limit Upper_limit 1 A 53 FALSE 37 0.6981132 0.06305900 0.5745176 0.8217088 2 A 53 TRUE 16 0.3018868 0.06305900 0.1782912 0.4254824 3 B 47 FALSE 33 0.7021277 0.06670743 0.5713811 0.8328742 4 B 47 TRUE 14 0.2978723 0.06670743 0.1671258 0.4286189 

It is a bit messy having to merge two separate groupings, but I don't know if it is possible to ungroup and re-group in the same statement. 必须merge两个单独的分组有点混乱,但是我不知道是否可以在同一条语句中取消分组和重新分组。 I am suprised though how difficult it seems to be to use groupings on two different levels (if you can call it that) and hope someone else can come up with a better solution. 我很惊讶,但是在两个不同级别上使用分组似乎很困难(如果可以这样称呼),并希望其他人可以提出更好的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM