简体   繁体   English

r data.table:聚合分组列不一致

[英]r data.table: aggregating the grouping column inconsistency

I'm using data.table package to aggregate a column which is also a grouping column. 我正在使用data.table包来聚合一个列,该列也是一个分组列。 But the results are not what I expected. 但结果并不是我的预期。

my_data =  data.table(contnt=c("america", "asia", "asia","europe", "europe", "europe"), num= 1:6)

#my_data
#contnt  num
#america  1
#asia     2
#asia     3
#europe   4
#europe   5
#europe   6

my_data[, length(contnt),by=contnt]
#contnt  V1
#america  1
#asia     1
#europe   1

It works differently when I aggregate a column other than grouping column 当我聚合除分组列之外的列时,它的工作方式不同

my_data[, length(num),by=contnt]
#contnt  V1
#america  1
#asia     2
#europe   3

What causes this discrepancy? 造成这种差异的原因是什么?

This is a great example to demonstrate the way data.table passes grouping variables vs. other variables to functions: 这是一个很好的例子来演示data.table将分组变量与其他变量分组到函数的方式:

my_data[,print(contnt),by=contnt]
# [1] "america"
# [1] "asia"
# [1] "europe"

my_data[,print(num),by=contnt]
# [1] 1
# [1] 2 3
# [1] 4 5 6

Essentially, grouping variables are passed as vectors of length 1 for each group, whereas for other variables, the entire vector for each group is passed. 实质上,分组变量作为长度为1的向量传递给每个组,而对于其他变量,则传递每个组的整个向量。

Please study the data.table FAQ : 请研究data.table常见问题

Inside each group, why are the group variables length-1? 在每个组内,为什么组变量长度为1?

[...] x is a grouping variable and (as from v1.6.1) has length 1 (if inspected or used in j ). [...] x是分组变量,(从v1.6.1开始)长度为1(如果在j检查或使用)。 It's for efficiency and convenience . 这是为了提高效率和方便性 [...] [...]

If you need the size of the current group, use .N rather than calling length() on any column. 如果需要当前组的大小,请使用.N而不是在任何列上调用length()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM