简体   繁体   中英

dplyr grouped cumulative set counting using group_by and rowwise do

I have grouped data with ordering within the groups where each row contains a list of values and within each group I'd like to produce a count of new list values contributed by each row to the union of the lists in each group.

Here is an example:

require(dplyr)
content <- list(c("A", "B"), c("A", "B", "C"), c("D", "E"), c("A", "B"), c("A", "B"), c("A", "B", "C"))
id <- c("a", "a", "a", "b", "b", "b")
order <- c(5, 7, 3, 1, 9, 4)
testdf <- data.frame(id, order, cbind(content))
testdf
#   id order content
# 1  a     5    A, B
# 2  a     7 A, B, C
# 3  a     3    D, E
# 4  b     1    A, B
# 5  b     9    A, B
# 6  b     4 A, B, C

My desired output (after sorting by order descending within each group) would be like:

#   id order content cc
# 1  a     7 A, B, C 3
# 2  a     5    A, B 3
# 3  a     3    D, E 5
# 4  b     9    A, B 2
# 5  b     4 A, B, C 3
# 6  b     1    A, B 3

cn (cumulative new) would be preferable to cc (cumulative count) really, but the above maps to my attempt below and cn is easily calculated subsequently. Here is my attempted solution that doesn't work:

res <- testdf %>% 
  arrange(id, desc(order)) %>% 
  mutate(n=row_number()) %>%
  group_by(id) %>%
  mutate(n1=first(n)) %>%
  rowwise() %>%
  bind_cols(do(.,data.frame(vars=length(unique(unlist(testdf$content[.$n1:.$n])))))) %>%
  data.frame

I actually obtained most of that solution from here: Cumulatively paste (concatenate) values grouped by another variable (thanks akrun). The values generated seem to be correct but they are not associated with the correct rows from the source data frame:

res
#   id order content n n1 vars
# 1  a     7 A, B, C 1  1    2
# 2  a     5    A, B 2  1    3
# 3  a     3    D, E 3  1    5
# 4  b     9    A, B 4  4    2
# 5  b     4 A, B, C 5  4    2
# 6  b     1    A, B 6  4    3

As you can see (looking at the vars column which is equivalent to cc above) for group 'a' values 2 and 3 are reversed and for group 'b' the second 2 and 3 values are reversed.

Actually I worked out what is wrong above , the testdf$content is (obviously) not ordered the same as the dplyr'd data frame. Originally I'd had .$content instead of testdf$content and that had produced even odder output. So I tried doing it in two stages:

res <- testdf %>% 
    arrange(id, desc(order)) %>% 
    mutate(n=row_number()) %>%
    group_by(id) %>%
    mutate(n1=first(n))
res <- res %>% 
    rowwise() %>%
    bind_cols(do(.,data.frame(vars=length(unique(unlist(res$content[.$n1:.$n])))))) %>%
    data.frame

and this produces what I expect:

#   id order content n n1 vars
# 1  a     7 A, B, C 1  1    3
# 2  a     5    A, B 2  1    3
# 3  a     3    D, E 3  1    5
# 4  b     9    A, B 4  4    2
# 5  b     4 A, B, C 5  4    3
# 6  b     1    A, B 6  4    3

So my question now is is there a better way to refer to the whole dplyr-modified data frame inside the do() (so that content is ordered correctly) - I think . is just the current row isn't it? Being able to do so would avoid me having to create the ordered data frame separately before the do() .

Many thanks

Tim

You can use the Reduce function with the accumulate mode to create cumulatively distinct elements and then use lengths function to return the cumulative distinct counts, this avoids the rowwise() operation:

library(dplyr)
testdf %>% 
          arrange(desc(order)) %>% 
          group_by(id) %>% 
          mutate(cc = lengths(Reduce(function(x, y) unique(c(x, y)), content, acc = T))) %>% 
          arrange(id)

#Source: local data frame [6 x 4]
#Groups: id [2]

#      id order   content    cc
#  <fctr> <dbl>    <list> <int>
#1      a     7 <chr [3]>     3
#2      a     5 <chr [2]>     3
#3      a     3 <chr [2]>     5
#4      b     9 <chr [2]>     2
#5      b     4 <chr [3]>     3
#6      b     1 <chr [2]>     3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM