简体   繁体   English

为什么 tidyverse group_by 在 R 更新后表现出意外

[英]why tidyverse group_by behave unexpected after R update

It used to work fine and then I updated R, After the updates.它曾经工作正常,然后我更新了 R,更新后。 the group_by function considers every row as a group. group_by function 将每一行视为一个组。 In the following example dataset dtt if I filter the dataset to only one group and run the code, it works as expected.在以下示例数据集dtt中,如果我将数据集过滤到仅一组并运行代码,它会按预期工作。 However if run the same code for all groups, it does not work as expected.但是,如果为所有组运行相同的代码,它不会按预期工作。

Here are working and not working codes and below is data.这是工作和不工作的代码,下面是数据。

#Filter dtt to only one group (x,y) and run the code then it works as expected as below #将dtt过滤到仅一组(x,y)并运行代码,然后它按预期工作,如下所示

dtt_xy<-dtt%>%
        filter(x==-121 & y == 65)
dtt_xy

dtt_output <- dtt_xy%>%
  group_by(x, y) %>%
  group_by(grp = cumsum(c(TRUE, diff(Date) != 1)), .add = TRUE)
dtt_output #Expected output

#Now if run the same code for the whole dataset ie, dtt it does not work #现在如果对整个数据集运行相同的代码,即dtt它不起作用

dtt_output <- dtt%>%
  group_by(x, y) %>%
  group_by(grp = cumsum(c(TRUE, diff(Date) != 1)), .add = TRUE)
dtt_output #Not expected output . expectation is 35 groups 

Sample Data样本数据

dtt<-structure(list(x = c(-121, -120, -121, -120, -121, -120, -121, 
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120, 
-121, -120, -121, -121, -121, -120, -121, -120, -121, -120, -121, 
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120, 
-121, -120, -121, -120, -121, -120, -121, -120, -121, -120, -121, 
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120, 
-121, -120, -121, -120, -121, -120, -121, -120, -121, -120, -121, 
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120, 
-121, -120, -120, -120, -121, -120, -121, -120, -121, -120, -121, 
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120, 
-121, -120, -121, -120, -121, -120, -121, -120, -121, -120, -121, 
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120, 
-121, -120, -121, -120, -121, -120, -121, -120, -121, -120, -121, 
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120
), y = c(65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 
65, 63, 63, 65, 65, 63, 63, 65, 63, 65, 65, 63, 63, 65, 65, 63, 
63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63, 
63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63, 
63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 65, 65, 65, 65, 65, 
65, 63, 63, 65, 65, 63, 63, 63, 63, 65, 63, 63, 63, 65, 65, 63, 
63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63, 63, 63, 63, 65, 
65, 65, 65, 65, 65, 65, 65, 65, 65, 63, 63, 65, 65, 63, 63, 65, 
65, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63, 
63, 65, 65, 63, 63, 65, 65, 63, 63), Date = structure(c(5123, 
5123, 5123, 5123, 5124, 5124, 5124, 5124, 5125, 5125, 5125, 5125, 
5126, 5126, 5126, 5126, 5127, 5127, 5127, 5127, 5128, 5128, 5177, 
5177, 5177, 5177, 5178, 5178, 5178, 5178, 5179, 5179, 5179, 5179, 
5180, 5180, 5180, 5180, 5181, 5181, 5181, 5181, 5200, 5200, 5200, 
5200, 5201, 5201, 5201, 5201, 5202, 5202, 5202, 5202, 5203, 5203, 
5203, 5203, 5204, 5204, 5204, 5204, 5205, 5205, 5205, 5205, 5206, 
5206, 5206, 5206, 5238, 5238, 5239, 5239, 5240, 5240, 5273, 5273, 
5273, 5273, 5274, 5274, 5274, 5274, 5319, 5319, 5320, 5325, 5326, 
5326, 5327, 5327, 5327, 5327, 5328, 5328, 5328, 5328, 5329, 5329, 
5329, 5329, 5330, 5330, 5330, 5330, 5331, 5331, 5344, 5344, 5345, 
5345, 5381, 5381, 5382, 5382, 5383, 5383, 5383, 5383, 5384, 5384, 
5384, 5384, 5401, 5401, 5402, 5402, 5402, 5402, 5403, 5403, 5403, 
5403, 5404, 5404, 5404, 5404, 5405, 5405, 5405, 5405, 5406, 5406, 
5406, 5406, 5407, 5407, 5407, 5407), class = "Date")), row.names = c(NA, 
-150L), class = c("tbl_df", "tbl", "data.frame"))

Session Info Session 信息

R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.5     purrr_0.3.4     readr_1.4.0     tidyr_1.1.3    
[7] tibble_3.1.0    ggplot2_3.3.3   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6       cellranger_1.1.0 pillar_1.5.1     compiler_4.0.4   dbplyr_2.1.0     tools_4.0.4     
 [7] jsonlite_1.7.2   lubridate_1.7.10 lifecycle_1.0.0  gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.10    
[13] reprex_1.0.0     cli_2.3.1        rstudioapi_0.13  DBI_1.1.1        haven_2.3.1      withr_2.4.1     
[19] xml2_1.3.2       httr_1.4.2       fs_1.5.0         generics_0.1.0   vctrs_0.3.6      hms_1.0.0       
[25] grid_4.0.4       tidyselect_1.1.0 glue_1.4.2       R6_2.5.0         fansi_0.4.2      readxl_1.3.1    
[31] modelr_0.1.8     magrittr_2.0.1   backports_1.2.1  scales_1.1.1     ellipsis_0.3.1   rvest_1.0.0     
[37] assertthat_0.2.1 colorspace_2.0-0 utf8_1.2.1       stringi_1.5.3    munsell_0.5.0    broom_0.7.5     
[43] crayon_1.4.1 

Yes, this has been one of the recent change in dplyr when you do a nested group_by .是的,当您执行嵌套group_by时,这是dplyr最近发生的变化之一。 An issue was created earlier for this but it was closed and it doesn't seem that this behaviour is going to change.之前为此创建了一个问题,但它已关闭,并且这种行为似乎不会改变。

Solution is to use mutate to create the new column and then use it in group_by .解决方案是使用mutate创建新列,然后在group_by中使用它。

library(dplyr)

dtt%>%
  group_by(x, y) %>%
  mutate(grp = cumsum(c(TRUE, diff(Date) != 1))) %>%
  group_by(grp, .add = TRUE)

# A tibble: 150 x 4
# Groups:   x, y, grp [35]
#       x     y Date         grp
#   <dbl> <dbl> <date>     <int>
# 1  -121    65 1984-01-11     1
# 2  -120    65 1984-01-11     1
# 3  -121    63 1984-01-11     1
# 4  -120    63 1984-01-11     1
# 5  -121    65 1984-01-12     1
# 6  -120    65 1984-01-12     1
# 7  -121    63 1984-01-12     1
# 8  -120    63 1984-01-12     1
# 9  -121    65 1984-01-13     1
#10  -120    65 1984-01-13     1
# … with 140 more rows

Does this answer your question?这回答了你的问题了吗?

dt %>%
  group_by(x, y) %>%
  mutate(grp = cumsum(c(TRUE, diff(Date) != 1)), .add = TRUE) %>%
  mutate(event = if (n() >= 5)
    cur_group_id()[n() >= 5]
    else    NA)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM