简体   繁体   English

dplyr使用group_by和rowwise do对累积集计数进行分组

[英]dplyr grouped cumulative set counting using group_by and rowwise do

I have grouped data with ordering within the groups where each row contains a list of values and within each group I'd like to produce a count of new list values contributed by each row to the union of the lists in each group. 我已将数据与组内的排序进行分组,其中每行包含值列表,并且在每个组中,我希望生成每行向每个组中的列表的并集贡献的新列表值的计数。

Here is an example: 这是一个例子:

require(dplyr)
content <- list(c("A", "B"), c("A", "B", "C"), c("D", "E"), c("A", "B"), c("A", "B"), c("A", "B", "C"))
id <- c("a", "a", "a", "b", "b", "b")
order <- c(5, 7, 3, 1, 9, 4)
testdf <- data.frame(id, order, cbind(content))
testdf
#   id order content
# 1  a     5    A, B
# 2  a     7 A, B, C
# 3  a     3    D, E
# 4  b     1    A, B
# 5  b     9    A, B
# 6  b     4 A, B, C

My desired output (after sorting by order descending within each group) would be like: 我想要的输出(在按每个组内下降顺序排序后)将如下:

#   id order content cc
# 1  a     7 A, B, C 3
# 2  a     5    A, B 3
# 3  a     3    D, E 5
# 4  b     9    A, B 2
# 5  b     4 A, B, C 3
# 6  b     1    A, B 3

cn (cumulative new) would be preferable to cc (cumulative count) really, but the above maps to my attempt below and cn is easily calculated subsequently. cn(累积新的)确实比cc(累积计数)更好,但是上面的图表映射到我下面的尝试,cn随后很容易计算出来。 Here is my attempted solution that doesn't work: 这是我尝试的解决方案不起作用:

res <- testdf %>% 
  arrange(id, desc(order)) %>% 
  mutate(n=row_number()) %>%
  group_by(id) %>%
  mutate(n1=first(n)) %>%
  rowwise() %>%
  bind_cols(do(.,data.frame(vars=length(unique(unlist(testdf$content[.$n1:.$n])))))) %>%
  data.frame

I actually obtained most of that solution from here: Cumulatively paste (concatenate) values grouped by another variable (thanks akrun). 我实际上从这里获得了大部分解决方案: 累积粘贴(连接)由另一个变量分组的值 (感谢akrun)。 The values generated seem to be correct but they are not associated with the correct rows from the source data frame: 生成的值似乎是正确的,但它们与源数据框中的正确行无关:

res
#   id order content n n1 vars
# 1  a     7 A, B, C 1  1    2
# 2  a     5    A, B 2  1    3
# 3  a     3    D, E 3  1    5
# 4  b     9    A, B 4  4    2
# 5  b     4 A, B, C 5  4    2
# 6  b     1    A, B 6  4    3

As you can see (looking at the vars column which is equivalent to cc above) for group 'a' values 2 and 3 are reversed and for group 'b' the second 2 and 3 values are reversed. 正如您所看到的(查看相当于上述cc的vars列)组'a'值2和3相反,对于组'b',第二个2和3值相反。

Actually I worked out what is wrong above , the testdf$content is (obviously) not ordered the same as the dplyr'd data frame. 实际上我找出了上面错误 ,testdf $内容(显然)没有与dplyr'd数据帧相同。 Originally I'd had .$content instead of testdf$content and that had produced even odder output. 最初我有.$content而不是testdf$content ,甚至产生了更奇怪的输出。 So I tried doing it in two stages: 所以我尝试分两个阶段:

res <- testdf %>% 
    arrange(id, desc(order)) %>% 
    mutate(n=row_number()) %>%
    group_by(id) %>%
    mutate(n1=first(n))
res <- res %>% 
    rowwise() %>%
    bind_cols(do(.,data.frame(vars=length(unique(unlist(res$content[.$n1:.$n])))))) %>%
    data.frame

and this produces what I expect: 这产生了我的期望:

#   id order content n n1 vars
# 1  a     7 A, B, C 1  1    3
# 2  a     5    A, B 2  1    3
# 3  a     3    D, E 3  1    5
# 4  b     9    A, B 4  4    2
# 5  b     4 A, B, C 5  4    3
# 6  b     1    A, B 6  4    3

So my question now is is there a better way to refer to the whole dplyr-modified data frame inside the do() (so that content is ordered correctly) - I think . 所以我现在的问题是有更好的方法来引用do()的整个dplyr修改数据框(以便正确地排序content ) - 我想. is just the current row isn't it? 只是当前行不是吗? Being able to do so would avoid me having to create the ordered data frame separately before the do() . 能够这样做将避免我必须在do()之前单独创建有序数据框。

Many thanks 非常感谢

Tim 蒂姆

You can use the Reduce function with the accumulate mode to create cumulatively distinct elements and then use lengths function to return the cumulative distinct counts, this avoids the rowwise() operation: 你可以用Reduce功能与accumulate模式创建累积不同的元素,然后使用lengths函数返回累积不同罪名,这避免了rowwise()操作:

library(dplyr)
testdf %>% 
          arrange(desc(order)) %>% 
          group_by(id) %>% 
          mutate(cc = lengths(Reduce(function(x, y) unique(c(x, y)), content, acc = T))) %>% 
          arrange(id)

#Source: local data frame [6 x 4]
#Groups: id [2]

#      id order   content    cc
#  <fctr> <dbl>    <list> <int>
#1      a     7 <chr [3]>     3
#2      a     5 <chr [2]>     3
#3      a     3 <chr [2]>     5
#4      b     9 <chr [2]>     2
#5      b     4 <chr [3]>     3
#6      b     1 <chr [2]>     3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM