简体   繁体   English

与R中的data.table聚合

[英]aggregation with data.table in R

The exercise consists in aggregating a numeric vector of values by a combination of factors with data.table in R. Take the following data table as example: 练习包括通过因子的组合和R中的data.table来聚合值的数值向量。以下面的数据表为例:

require (data.table)
require (plyr)
dtb <- data.table (cbind (expand.grid (month = rep (month.abb[1:3], each = 3),
                                       fac = letters[1:3]),
                          value = rnorm (27)))

Notice that every unique combination of 'month' and 'fac' shows up three times. 请注意,'month'和'fac'的每个独特组合都会出现三次。 So, when I try to average values by both these factors, I should expect a data frame with 9 unique rows: 因此,当我尝试通过这两个因素平均值时,我应该期望一个包含9个唯一行的数据框:

(agg1 <- ddply (dtb, c ("month", "fac"), function (dfr) mean (dfr$value)))
  month fac          V1
1   Jan   a -0.36030953
2   Jan   b -0.58444588
3   Jan   c -0.15472876
4   Feb   a -0.05674483
5   Feb   b  0.26415972
6   Feb   c -1.62346772
7   Mar   a  0.24560510
8   Mar   b  0.82548140
9   Mar   c  0.18721114

However, when aggregating with data.table, I keep getting the results provided by every redundant combination of the two factors: 但是,当与data.table聚合时,我会不断得到两个因素的每个冗余组合提供的结果:

(agg2 <- dtb[, value := mean (value), by = list (month, fac)])
    month fac       value
 1:   Jan   a -0.36030953
 2:   Jan   a -0.36030953
 3:   Jan   a -0.36030953
 4:   Feb   a -0.05674483
 5:   Feb   a -0.05674483
 6:   Feb   a -0.05674483
 7:   Mar   a  0.24560510
 8:   Mar   a  0.24560510
 9:   Mar   a  0.24560510
10:   Jan   b -0.58444588
11:   Jan   b -0.58444588
12:   Jan   b -0.58444588
13:   Feb   b  0.26415972
14:   Feb   b  0.26415972
15:   Feb   b  0.26415972
16:   Mar   b  0.82548140
17:   Mar   b  0.82548140
18:   Mar   b  0.82548140
19:   Jan   c -0.15472876
20:   Jan   c -0.15472876
21:   Jan   c -0.15472876
22:   Feb   c -1.62346772
23:   Feb   c -1.62346772
24:   Feb   c -1.62346772
25:   Mar   c  0.18721114
26:   Mar   c  0.18721114
27:   Mar   c  0.18721114
    month fac       value

Is there an elegant way to collapse these results to one row per unique combination of factors with data table? 是否有一种优雅的方法可以将这些结果折叠为每个独特的因子组合与数据表的一行?

The issue (and reasoning) is related to the fact that aggregated value is being assigned not just calculated. 问题(和推理)与聚合值的分配不仅仅是计算有关。

It is easier to observe this in action if you look at a data.table with more columns than just the ones being used for the computation. 如果你查看一个包含更多列而不仅仅是用于计算的列的data.table,则更容易观察到这一点。

# Therefore, let's add a new column
dtb[, newCol := LETTERS[seq(length(value))]

Note that if we just want to output the computed value, then expression on the RHS as you have it is just fine. 请注意,如果我们只想输出计算值,那么RHS上的表达式就好了。

# This gives the expected results
dtb[, mean (value), by = list (month, fac)]

# This on the other hand assigns the respective values to *each* row
dtb[, value := mean (value), by = list (month, fac)]

In other words, the data is being subsetted to only return unique values. 换句话说,数据被子集化为仅返回唯一值。
However, if you want to save this value back into the SAME data table (which is what happens when using := operator) then all rows that are identified in i (all rows by defualt) will be assigned a value. 但是,如果要将此值保存回SAME数据表(使用:=运算符时会发生这种情况),则i中标识的所有行(defualt的所有行)都将分配一个值。 (which, when you look at the output with additional columns, makes sense) (当你用附加列查看输出时,这是有道理的)

Then copying this data.table to agg still sends through all the rows. 然后将此data.table复制到agg仍然会通过所有行发送。

Therefore, if you want to copy to a new table, only those rows from your original table that are unique , you can 因此,如果要复制到新表, 只能从原始表中那些唯一的行 ,即可

a.  wrap the original table inside `unique()` before assigning it
b.  assign the table, above, that is returned when you 
    are not assigning the RHS output (which is what @Arun suggested)

An example of a. 一个例子a. would be: 将会:

 agg2 <- unique(dtb[, value := mean (value), by = list (month, fac)])

The following example might help illustrate. 以下示例可能有助于说明。

(You would need to copy + paste this, as the output is ommitted) (你需要复制+粘贴它,因为输出被省略)

  # SAMPLE DATA, as above
  library(data.table)
  dtb.bak <- data.table (expand.grid (month = rep (month.abb[1:3], each = 3), fac = letters[1:3]), value = rnorm (27))

  #  METHOD 1  # 
  #------------#
  dtb <- copy(dtb.bak)  # restore, from sample data.


  dtb[, value := mean (value), by = list (month, fac)]
  dtb

  # this is what you would like to assign
  unique(dtb)


  #  METHOD 2  # 
  #------------#
  dtb <- copy(dtb.bak)  # restore, from sample data.

  # this is what you would like to assign
  # next two lines are the same, only differnce is column name
  dtb[, mean (value), by = list (month, fac)]
  dtb[, list("mean" = mean (value)), by = list (month, fac)]  # quote marks added for clarity

  # dtb is unchanged. 
  dtb



  # NOW COMPARE THE SAME TWO METHODS, BUT IF THERE IS AN ADDITIOANL COLUMN
  dtb.bak[, newCol := rep(c("A", "B", "A"), length(value)/3)]


  dtb1 <- copy(dtb.bak)  # restore, from sample data.
  dtb2 <- copy(dtb.bak)  # restore, from sample data.


  # Method 1
  dtb1[, value := mean (value), by = list (month, fac)]
  dtb1
  unique(dtb1)

  #  METHOD 2  # 
  dtb2[, list("mean" = mean (value)), by = list (month, fac)]  # quote marks added for clarity
  dtb2

  # METHOD 2, WITH ADDED COLUMNS IN list() in `j`
  dtb2[, list("mean" = mean (value), newCol), by = list (month, fac)]  # quote marks added for clarity
  # notice this has more columns thatn 
  unique(dtb1)

You should do: 你应该做:

agg2 <- dtb[, list(value = mean(value)), by = list (month, fac)]

:= will recycle values for RHS to fit the number of elements in LHS . :=将回收RHS值以适应LHS的元素数量。 Do ?':=' to read more about this. ?':='阅读更多相关信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM