[英]r data.table: aggregating the grouping column inconsistency
I'm using data.table package to aggregate a column which is also a grouping column. 我正在使用data.table包来聚合一个列,该列也是一个分组列。 But the results are not what I expected. 但结果并不是我的预期。
my_data = data.table(contnt=c("america", "asia", "asia","europe", "europe", "europe"), num= 1:6)
#my_data
#contnt num
#america 1
#asia 2
#asia 3
#europe 4
#europe 5
#europe 6
my_data[, length(contnt),by=contnt]
#contnt V1
#america 1
#asia 1
#europe 1
It works differently when I aggregate a column other than grouping column 当我聚合除分组列之外的列时,它的工作方式不同
my_data[, length(num),by=contnt]
#contnt V1
#america 1
#asia 2
#europe 3
What causes this discrepancy? 造成这种差异的原因是什么?
This is a great example to demonstrate the way data.table passes grouping variables vs. other variables to functions: 这是一个很好的例子来演示data.table将分组变量与其他变量分组到函数的方式:
my_data[,print(contnt),by=contnt]
# [1] "america"
# [1] "asia"
# [1] "europe"
my_data[,print(num),by=contnt]
# [1] 1
# [1] 2 3
# [1] 4 5 6
Essentially, grouping variables are passed as vectors of length 1 for each group, whereas for other variables, the entire vector for each group is passed. 实质上,分组变量作为长度为1的向量传递给每个组,而对于其他变量,则传递每个组的整个向量。
Please study the data.table
FAQ : 请研究data.table
常见问题 :
Inside each group, why are the group variables length-1? 在每个组内,为什么组变量长度为1?
[...]
x
is a grouping variable and (as from v1.6.1) has length 1 (if inspected or used inj
). [...]x
是分组变量,(从v1.6.1开始)长度为1(如果在j
检查或使用)。 It's for efficiency and convenience . 这是为了提高效率和方便性 。 [...] [...]If you need the size of the current group, use
.N
rather than callinglength()
on any column. 如果需要当前组的大小,请使用.N
而不是在任何列上调用length()
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.