简体   繁体   English

在R tidyverse中使用group_by中的两个变量时的分母

[英]Denominator when using two variables in group_by in R tidyverse

I want to calculate the mean and standard deviation contacts for twenty types of hospital services in two arms of a trial. 我想计算试验的两个部门中20种类型的医院服务的平均值和标准差接触。 I have done this so far by using group_by(arm, service) . 到目前为止,我已经使用group_by(arm, service)完成了这项group_by(arm, service) This gives the average of the people who use that service in that arm. 这给出了在该组中使用该服务的人的平均值。 What my boss wants instead is the average of each service, divided by everyone in that arm. 我的老板想要的是每个服务的平均值,除以该手臂中的每个人。

So, if there are 100 cardiology contacts, 30 patients in each arm, but 10 attend a cardiology appointment, the calculation should be 100/30 rather than 100/10. 因此,如果有100名心脏病学接触者,每组30名患者,但10名参加心脏病学预约,计算应该是100/30而不是100/10。 The only way I can think about doing it is splitting the arms out into separate datasets and then I would only need to group by service, which solves the problem. 我能想到的唯一方法是将手臂分成单独的数据集,然后我只需要按服务分组,这就解决了问题。

An example of what this looks like: 这看起来像一个例子:

rep_prob <- tibble(id = 1:6, arm = c(1,1,1,0,0,0), service = c(1,1,2,1,2,2), contacts = c(21,3,14, 2,5,10)) %>% 
  group_by(arm, service) %>% 
  summarise(mean = mean(contacts), sd = sd(contacts))

Which gives results that look like this: 这给出了如下结果:

arm  service  mean   sd
0     1        2.0   NaN
0     2        7.5   3.535534
1     1        12.0  12.727922
1     2        14.0  NaN

Where instead I want the option to give the mean and SD of each service compared to the arm as a whole, not as the subgroup of arm and service. 相反,我希望选择给出每个服务的平均值和SD与整个手臂相比,而不是作为手臂和服务的子组。

This is apparently very easy in Stata and I am the only person in my department who uses R. For all my other results tables I am only slicing my table by one variable and so using group_by(arm) and then summarising works. 这在Stata中显然非常容易,我是我部门中唯一一个使用R的人。对于我所有的其他结果表,我只用一个变量切片我的表,所以使用group_by(arm)然后总结作品。

Perhaps what you are after is along the lines of: 也许你所追求的是:

library(tidyverse)

dat <- tibble(
    id = 1:6, 
    arm = c(1,1,1,0,0,0), 
    service = c(1,1,2,1,2,2), 
    contacts = c(21,3,14, 2,5,10)
) 

rep_prob <- dat %>% 
    group_by(arm, service) %>% 
    mutate(sum = sum(contacts)) %>%
    group_by(arm) %>%
    mutate(mean = sum / sum(contacts)) %>%
    ungroup()

which calculates group sums by arm and service divided by group sample sizes per arm category. 它按armservice计算组总和除以每个arm类别的组样本大小。 The definition of the sd would depend on the way the observations are being centered (ie how the sample mean is defined per group). sd的定义将取决于观察的集中方式(即每组如何定义样本均值)。

NB: splitting dat into separate datasets by the variable arm and grouping by service would give the same results as grouping by both arm and service directly, which is probably not what you have in mind. 注意:通过变量armdat分成不同的数据集,并按service分组,可以得到与armservice直接分组相同的结果,这可能不是你想到的。


Edit: if you prefer to use summarise , you could also rearrange expressions as: 编辑:如果您更喜欢使用summarise ,您还可以将表达式重新排列为:

rep_prob <- dat %>% 
   group_by(arm) %>% 
   mutate(contacts_scaled = contacts / sum(contacts)) %>%
   group_by(service, add = TRUE) %>%
   summarise(mean = sum(contacts_scaled)) %>%
   ungroup()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM