[英]How to cumsum in R based on certain fields?
I apologize for the length but it is necessary in order to not skip over details and make this more confusing than it is already.我为篇幅道歉,但为了不跳过细节并使这比现在更加混乱,这是必要的。
Below is the sample data and some of the manipulations that I have done so far.以下是我迄今为止所做的示例数据和一些操作。
library(dplyr)
library(tidyverse)
emp <- c(1,2,3,4,5,6,7,8,1,12,54,101,33,159,201,261,110,195,131,228)
small <- c(1,1,1,1,1,1,1,1,1,1,1,2,1,3,3,4,2,3,2,3)
area <-c(003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003)
twodigit <-c(11,22,11,22,23,22,11,31,44,45,21,44,45,62,72,22,45,72,45,21)
smbtest2 <- data.frame(emp,small,area,twodigit)
So before I get too far the goal is to sum the employment (emp) by small (schema below) and then break it down by two digit (industry code).因此,在我走得太远之前,目标是将就业(emp)按小(下图)求和,然后将其分解为两位数(行业代码)。 In this simple example, I want the top 3 industries per every small category.
在这个简单的例子中,我想要每个小类别的前 3 个行业。 I am trying to cumsum because if one is in the first category (0 to 99) then it will be in the second category (0 to 149).
我正在尝试 cumsum,因为如果一个属于第一类(0 到 99),那么它将属于第二类(0 到 149)。
smbsummary3<-smbtest2 %>%
group_by(area,small,twodigit) %>%
summarise(emp = sum(emp), worksites = n(),
.groups = 'drop_last')%>%
slice_max(emp,n=3)
smbsummary4<-smbsummary3 %>%
ungroup %>%
complete(area, small = unique(small)) %>%
fill(emp, worksites)
Schema for small
1 0 to 99
2 0 to 149
3 0 to 249
4 0 to 499
Desired Result期望的结果
area small twodigit emp worksites
003 1 21 54 1
003 1 45 45 2 (12+33)
003 1 22 12 3 (2+4+6)
003 2 45 286 4 (12+33+110+131)
003 2 44 102 2 (1+101)
003 2 21 54 1
At present, it sums based purely on small which is what it should do based on the code.目前,它是纯粹基于小求和,这是它应该基于代码做的。 However, my question is how do I get it to cumsum (cumulative sum) based on the small category?
但是,我的问题是如何根据小类别将其变为cumsum(累计和)?
Below is my latest attempt.下面是我最近的尝试。 It does not add up to the correct answer but I think it is close to being the correct set of commands.
它不加起来就是正确的答案,但我认为它接近正确的命令集。
smbsummary3<-smbtest2 %>%
group_by(area,small,twodigit) %>%
summarise(emp = sum(emp), worksites = n(),
.groups = 'drop_last')%>%
mutate(emp = cumsum(emp),
worksites = cumsum(worksites))%>%
slice_max(emp,n=3)
I was going to try to explain in a comment, but this seemed easier.我打算在评论中解释一下,但这似乎更容易。
Maybe you want to group_by
just area
and twodigit
before doing your cumulative sum.也许您想在进行累积
group_by
之前仅对area
和twodigit
进行分组。
Then, group_by
again to select the top 3 emp
values by area
and small
.然后,
group_by
再次到 select 的前 3 个emp
值 by area
和small
。 The resulting output looks very similar (could not find small
2 and twodigit
21 in dataset).生成的 output 看起来非常相似(在数据集中找不到
small
2 和twodigit
21)。
smbtest2 %>%
group_by(area, small, twodigit) %>%
summarise(emp = sum(emp),
worksites = n(),
.groups = 'drop_last') %>%
group_by(area, twodigit) %>%
mutate(emp = cumsum(emp),
worksites = cumsum(worksites)) %>%
group_by(area, small) %>%
slice_max(emp, n = 3) %>%
arrange(area, small, desc(emp))
Output Output
area small twodigit emp worksites
<dbl> <dbl> <dbl> <dbl> <int>
1 3 1 21 54 1
2 3 1 45 45 2
3 3 1 22 12 3
4 3 2 45 286 4
5 3 2 44 102 2
6 3 3 72 396 2
7 3 3 21 282 2
8 3 3 62 159 1
9 3 4 22 273 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.