[英]dplyr group_by summarise inconsistent number of rows
I have been following the tutorial on DataCamp . 我一直在关注DataCamp上的教程。 I have the following line of code, that when I run it produces a different value for "drows" 我有以下代码行,当我运行它时,它会产生不同的“卓尔”值
hflights %>%
group_by(UniqueCarrier, Dest) %>%
summarise(rows= n(), drows = n_distinct(rows))
First time: 第一次:
Source: local data frame [234 x 4]
Groups: UniqueCarrier [?]
UniqueCarrier Dest rows drows
<chr> <chr> <int> <int>
1 AirTran ATL 211 86
2 AirTran BKG 14 6
3 Alaska SEA 32 18
4 American DFW 186 74
5 American MIA 129 57
6 American_Eagle DFW 234 101
7 American_Eagle LAX 74 34
8 American_Eagle ORD 133 56
9 Atlantic_Southeast ATL 64 28
10 Atlantic_Southeast CVG 1 1
# ... with 224 more rows
Second time: 第二次:
Source: local data frame [234 x 4]
Groups: UniqueCarrier [?]
UniqueCarrier Dest rows drows
<chr> <chr> <int> <int>
1 AirTran ATL 211 125
2 AirTran BKG 14 13
3 Alaska SEA 32 29
4 American DFW 186 118
5 American MIA 129 76
6 American_Eagle DFW 234 143
7 American_Eagle LAX 74 47
8 American_Eagle ORD 133 85
9 Atlantic_Southeast ATL 64 44
10 Atlantic_Southeast CVG 1 1
# ... with 224 more rows
Third time: 第三次:
Source: local data frame [234 x 4]
Groups: UniqueCarrier [?]
UniqueCarrier Dest rows drows
<chr> <chr> <int> <int>
1 AirTran ATL 211 88
2 AirTran BKG 14 7
3 Alaska SEA 32 16
4 American DFW 186 79
5 American MIA 129 61
6 American_Eagle DFW 234 95
7 American_Eagle LAX 74 31
8 American_Eagle ORD 133 67
9 Atlantic_Southeast ATL 64 31
10 Atlantic_Southeast CVG 1 1
# ... with 224 more rows
My question is why does this value constantly change? 我的问题是为什么这个价值会不断变化? What is it doing? 到底在做什么
Apparently this is normal behaviour, see this issue here. 显然,这是正常现象,请在此处查看此问题。 https://github.com/tidyverse/dplyr/issues/2222 . https://github.com/tidyverse/dplyr/issues/2222
This is because values in list columns are compared by reference, so n_distinct() treats them as different unless they really point to the same object: 这是因为列表列中的值是按引用进行比较的,所以n_distinct()会将它们视为不同,除非它们确实指向同一对象:
So the internal storage of the df changes the way the thing works. 因此,df的内部存储改变了事物的工作方式。 Hadley's comment in that issue seems to say it might be a bug (in the sense of unwanted behaviour), or it might be expected behaviour they need to document better. 哈德利(Hadley)在该问题上的评论似乎表明,这可能是一个错误(就不良行为而言),或者可能是他们需要更好地记录下来的预期行为。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.