[英]why tidyverse group_by behave unexpected after R update
It used to work fine and then I updated R, After the updates.它曾经工作正常,然后我更新了 R,更新后。 the group_by function considers every row as a group.
group_by function 将每一行视为一个组。 In the following example dataset
dtt
if I filter the dataset to only one group and run the code, it works as expected.在以下示例数据集
dtt
中,如果我将数据集过滤到仅一组并运行代码,它会按预期工作。 However if run the same code for all groups, it does not work as expected.但是,如果为所有组运行相同的代码,它不会按预期工作。
Here are working and not working codes and below is data.这是工作和不工作的代码,下面是数据。
#Filter dtt
to only one group (x,y) and run the code then it works as expected as below #将
dtt
过滤到仅一组(x,y)并运行代码,然后它按预期工作,如下所示
dtt_xy<-dtt%>%
filter(x==-121 & y == 65)
dtt_xy
dtt_output <- dtt_xy%>%
group_by(x, y) %>%
group_by(grp = cumsum(c(TRUE, diff(Date) != 1)), .add = TRUE)
dtt_output #Expected output
#Now if run the same code for the whole dataset ie, dtt
it does not work #现在如果对整个数据集运行相同的代码,即
dtt
它不起作用
dtt_output <- dtt%>%
group_by(x, y) %>%
group_by(grp = cumsum(c(TRUE, diff(Date) != 1)), .add = TRUE)
dtt_output #Not expected output . expectation is 35 groups
Sample Data样本数据
dtt<-structure(list(x = c(-121, -120, -121, -120, -121, -120, -121,
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120,
-121, -120, -121, -121, -121, -120, -121, -120, -121, -120, -121,
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120,
-121, -120, -121, -120, -121, -120, -121, -120, -121, -120, -121,
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120,
-121, -120, -121, -120, -121, -120, -121, -120, -121, -120, -121,
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120,
-121, -120, -120, -120, -121, -120, -121, -120, -121, -120, -121,
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120,
-121, -120, -121, -120, -121, -120, -121, -120, -121, -120, -121,
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120,
-121, -120, -121, -120, -121, -120, -121, -120, -121, -120, -121,
-120, -121, -120, -121, -120, -121, -120, -121, -120, -121, -120
), y = c(65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63, 63, 65,
65, 63, 63, 65, 65, 63, 63, 65, 63, 65, 65, 63, 63, 65, 65, 63,
63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63,
63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63,
63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 65, 65, 65, 65, 65,
65, 63, 63, 65, 65, 63, 63, 63, 63, 65, 63, 63, 63, 65, 65, 63,
63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63, 63, 63, 63, 65,
65, 65, 65, 65, 65, 65, 65, 65, 65, 63, 63, 65, 65, 63, 63, 65,
65, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63, 63, 65, 65, 63,
63, 65, 65, 63, 63, 65, 65, 63, 63), Date = structure(c(5123,
5123, 5123, 5123, 5124, 5124, 5124, 5124, 5125, 5125, 5125, 5125,
5126, 5126, 5126, 5126, 5127, 5127, 5127, 5127, 5128, 5128, 5177,
5177, 5177, 5177, 5178, 5178, 5178, 5178, 5179, 5179, 5179, 5179,
5180, 5180, 5180, 5180, 5181, 5181, 5181, 5181, 5200, 5200, 5200,
5200, 5201, 5201, 5201, 5201, 5202, 5202, 5202, 5202, 5203, 5203,
5203, 5203, 5204, 5204, 5204, 5204, 5205, 5205, 5205, 5205, 5206,
5206, 5206, 5206, 5238, 5238, 5239, 5239, 5240, 5240, 5273, 5273,
5273, 5273, 5274, 5274, 5274, 5274, 5319, 5319, 5320, 5325, 5326,
5326, 5327, 5327, 5327, 5327, 5328, 5328, 5328, 5328, 5329, 5329,
5329, 5329, 5330, 5330, 5330, 5330, 5331, 5331, 5344, 5344, 5345,
5345, 5381, 5381, 5382, 5382, 5383, 5383, 5383, 5383, 5384, 5384,
5384, 5384, 5401, 5401, 5402, 5402, 5402, 5402, 5403, 5403, 5403,
5403, 5404, 5404, 5404, 5404, 5405, 5405, 5405, 5405, 5406, 5406,
5406, 5406, 5407, 5407, 5407, 5407), class = "Date")), row.names = c(NA,
-150L), class = c("tbl_df", "tbl", "data.frame"))
R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.5 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
[7] tibble_3.1.0 ggplot2_3.3.3 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 cellranger_1.1.0 pillar_1.5.1 compiler_4.0.4 dbplyr_2.1.0 tools_4.0.4
[7] jsonlite_1.7.2 lubridate_1.7.10 lifecycle_1.0.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.10
[13] reprex_1.0.0 cli_2.3.1 rstudioapi_0.13 DBI_1.1.1 haven_2.3.1 withr_2.4.1
[19] xml2_1.3.2 httr_1.4.2 fs_1.5.0 generics_0.1.0 vctrs_0.3.6 hms_1.0.0
[25] grid_4.0.4 tidyselect_1.1.0 glue_1.4.2 R6_2.5.0 fansi_0.4.2 readxl_1.3.1
[31] modelr_0.1.8 magrittr_2.0.1 backports_1.2.1 scales_1.1.1 ellipsis_0.3.1 rvest_1.0.0
[37] assertthat_0.2.1 colorspace_2.0-0 utf8_1.2.1 stringi_1.5.3 munsell_0.5.0 broom_0.7.5
[43] crayon_1.4.1
Yes, this has been one of the recent change in dplyr
when you do a nested group_by
.是的,当您执行嵌套
group_by
时,这是dplyr
最近发生的变化之一。 An issue was created earlier for this but it was closed and it doesn't seem that this behaviour is going to change.之前为此创建了一个问题,但它已关闭,并且这种行为似乎不会改变。
Solution is to use mutate
to create the new column and then use it in group_by
.解决方案是使用
mutate
创建新列,然后在group_by
中使用它。
library(dplyr)
dtt%>%
group_by(x, y) %>%
mutate(grp = cumsum(c(TRUE, diff(Date) != 1))) %>%
group_by(grp, .add = TRUE)
# A tibble: 150 x 4
# Groups: x, y, grp [35]
# x y Date grp
# <dbl> <dbl> <date> <int>
# 1 -121 65 1984-01-11 1
# 2 -120 65 1984-01-11 1
# 3 -121 63 1984-01-11 1
# 4 -120 63 1984-01-11 1
# 5 -121 65 1984-01-12 1
# 6 -120 65 1984-01-12 1
# 7 -121 63 1984-01-12 1
# 8 -120 63 1984-01-12 1
# 9 -121 65 1984-01-13 1
#10 -120 65 1984-01-13 1
# … with 140 more rows
Does this answer your question?这回答了你的问题了吗?
dt %>%
group_by(x, y) %>%
mutate(grp = cumsum(c(TRUE, diff(Date) != 1)), .add = TRUE) %>%
mutate(event = if (n() >= 5)
cur_group_id()[n() >= 5]
else NA)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.