简体   繁体   中英

Understanding dplyr and group_by

I've been leveraging dplyr in my workflows for quite some time. I'm coming to the realization that perhaps I don't understand the group_by function. Can someone please explain if there is a better approach to accomplishing my goals.

My initial understanding was that by introducing group_by() before operations such as mutate, the mutate function would perform its function dicretely across groups specified by group_by(), restarting it's operation on each Condition specified by group_by()

This doesn't seem to be true and I've had to resort to splitting my data tables into lists by the Condition that I had previously entered into group_by(), performing my intended functions, and then collapsing the list back into a matrix; by the use of lapply.

Example below. My intention was to perform a cumsum operation on column TVC for each Condition. However, you'll see that the summation column is a straightforward cumsum operation across the TVC column without discretization between groups specified by the Condition column.

> (data %>% filter(`Elapsed Time (days)`<=8) %>%
+   arrange(Condition,`Elapsed Time (days)`) %>%
+   select(Condition, `Elapsed Time (days)`, TVC) %>%
+   filter(!is.na(TVC)) %>%
+   group_by(Condition) %>%
+   mutate(summation =cumsum(TVC)))
# A tibble: 94 x 4
# Groups:   Condition [24]
   Condition `Elapsed Time (days)`       TVC  summation
   <chr>     <drtn>                    <dbl>      <dbl>
 1 1A        0.000000 secs         15400921.  15400921.
 2 1A        4.948611 secs         11877256.  27278177 
 3 1A        6.027778 secs         11669731.  38947908.
 4 1A        6.949306 secs         11908853.  50856761.
 5 1B        0.000000 secs         14514263.  65371024.
 6 1B        4.948611 secs          8829356.  74200380.
 7 1B        6.027778 secs         12068221.  86268601.
 8 1B        6.949306 secs         10111424.  96380026.
 9 1C        0.000000 secs         15400921. 111780946.
10 1C        4.948611 secs          8680060  120461006.

Hey I would try this operation before your code chunk:

df$Condition <- as.factor(df$Condition)

I think group_by works best when working with factors. I think it's supposed to work with characters also but in my experience factor is better with fewer bugs. I don't know if others have this issue.

After that, do this, as Karthik suggests:

df %>% group_by(Condition) %>% mutate(summation =cumsum(TVC))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM