正确/最快地重塑数据表的方式

Question

I have a data table in R: 我在R中有一个数据表：

library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
      x y  v
 [1,] 1 A 12
 [2,] 1 B 62
 [3,] 1 A 60
 [4,] 1 B 61
 [5,] 2 A 83
 [6,] 2 B 97
 [7,] 2 A  1
 [8,] 2 B 22
 [9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49

I can easily sum the variable v by the groups in the data.table: 我可以通过data.table中的组轻松地对变量v求和：

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out
     x  y SUM
[1,] 1 A  72
[2,] 1 B 123
[3,] 2 A  84
[4,] 2 B 119
[5,] 3 A 162
[6,] 3 B  96

However, I would like to have the groups (y) as columns, rather than rows. 但是，我想将组（y）作为列而不是行。 I can accomplish this using reshape : 我可以使用reshape完成此操作：

out <- reshape(out,direction='wide',idvar='x', timevar='y')
out
     x SUM.A SUM.B
[1,] 1    72   123
[2,] 2    84   119
[3,] 3   162    96

Is there a more efficient way to reshape the data after aggregating it? 汇总数据后，是否有更有效的方式来重塑数据？ Is there any way to combine these operations into one step, using the data.table operations? 是否有任何方法可以使用data.table操作将这些操作组合为一个步骤？

Answer 1

The data.table package implements faster melt/dcast functions (in C). data.table包实现了更快的melt/dcast功能（用C语言melt/dcast ）。 It also has additional features by allowing to melt and cast multiple columns . 通过允许熔化和浇铸多列，它还具有其他功能。 Please see the new Efficient reshaping using data.tables on Github. 请在Github上使用data.tables查看新的高效重塑。

melt/dcast functions for data.table have been available since v1.9.0 and the features include: 从v1.9.0版本开始提供data.table的melt / dcast功能，其功能包括：

There is no need to load reshape2 package prior to casting. 铸造前无需加载reshape2包装。 But if you want it loaded for other operations, please load it before loading data.table . 但是，如果您希望将其加载以进行其他操作，请在加载data.table 之前先加载data.table 。
dcast is also a S3 generic. dcast也是S3的泛型。 No more dcast.data.table() . 没有更多的dcast.data.table() 。 Just use dcast() . 只需使用dcast() 。
melt : melt
- is capable of melting on columns of type 'list'. 能够融化“列表”类型的列。
- gains variable.factor and value.factor which by default are TRUE and FALSE respectively for compatibility with reshape2 . 获得variable.factor和value.factor ，默认情况下分别为TRUE和FALSE ，以与reshape2兼容。 This allows for directly controlling the output type of variable and value columns (as factors or not). 这允许直接控制variable和value列的输出类型（是否为因子）。
- melt.data.table 's na.rm = TRUE parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient. melt.data.table的na.rm = TRUE参数进行内部优化，以在融化过程中直接去除NA，因此效率更高。
- NEW: melt can accept a list for measure.vars and columns specified in each element of the list will be combined together. 新增内容： melt可以接受一个measure.vars列表measure.vars列表中每个元素中指定的measure.vars和列将合并在一起。 This is faciliated further through the use of patterns() . 通过使用patterns()可以进一步简化此过程。 See vignette or ?melt . 参见晕影或?melt 。
dcast : dcast ：
- accepts multiple fun.aggregate and multiple value.var . 接受多个fun.aggregate和多个value.var 。 See vignette or ?dcast . 参见小插图或?dcast 。
- use rowid() function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. 直接在公式中使用rowid()函数生成一个id列，有时需要该ID来唯一标识行。 See ?dcast. 参见？dcast。
Old benchmarks: 旧基准：
- melt : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds. melt ：1000万行和5列，从61.3秒减少到1.2秒。
- dcast : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds. dcast ：1百万行4列，从192秒减少到3.6秒。

Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast pull request to reshape2 ? 《科隆提醒（2013年12月）》演示幻灯片32：为什么不提交dcast pull请求来reshape2 ？

Answer 2

This feature is now implemented into data.table (from version 1.8.11 on), as can be seen in Zach's answer above. 现在可以在data.table中实现此功能（从版本1.8.11开始），如上面Zach的答案所示。

I just saw this great chunk of code from Arun here on SO . 我刚刚在SO上看到了来自Arun的大量代码。 So I guess there is a data.table solution. 所以我想有一个data.table解决方案。 Applied to this problem: 应用于此问题：

library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=1e6), 
                  y=c("A","B"), 
                  v=sample(1:100,12))

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
# edit (mnel) to avoid setNames which creates a copy
# when calling `names<-` inside the function
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]
})
   x        A        B
1: 1 26499966 28166677
2: 2 26499978 28166673
3: 3 26500056 28166650

This gives the same results as DWin's approach: 这与DWin的方法具有相同的结果：

tapply(DT$v,list(DT$x, DT$y), FUN=sum)
         A        B
1 26499966 28166677
2 26499978 28166673
3 26500056 28166650

Also, it is fast: 而且，它很快：

system.time({ 
   out <- DT[,list(SUM=sum(v)),by=list(x,y)]
   out[, as.list(setattr(SUM, 'names', y)), by=list(x)]})
##  user  system elapsed 
## 0.64    0.05    0.70 
system.time(tapply(DT$v,list(DT$x, DT$y), FUN=sum))
## user  system elapsed 
## 7.23    0.16    7.39

UPDATE 更新

So that this solution also works for non-balanced data sets (ie some combinations do not exist), you have to enter those in the data table first: 为了使该解决方案也适用于非平衡数据集（即某些组合不存在），您必须首先在数据表中输入这些组合：

library(data.table)
set.seed(1234)
DT <- data.table(x=c(rep(c(1,2,3),each=4),3,4), y=c("A","B"), v=sample(1:100,14))

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
setkey(out, x, y)

intDT <- expand.grid(unique(out[,x]), unique(out[,y]))
setnames(intDT, c("x", "y"))
out <- out[intDT]

out[, as.list(setattr(SUM, 'names', y)), by=list(x)]

Summary 摘要

Combining the comments with the above, here's the 1-line solution: 结合上面的评论，这是一线解决方案：

DT[, sum(v), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
   setNames(as.list(V1), paste(y)), by = x]

It's also easy to modify this to have more than just the sum, eg: 也可以很容易地修改它，使其不仅具有总和，例如：

DT[, list(sum(v), mean(v)), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
   setNames(as.list(c(V1, V2)), c(paste0(y,".sum"), paste0(y,".mean"))), by = x]
#   x A.sum B.sum   A.mean B.mean
#1: 1    72   123 36.00000   61.5
#2: 2    84   119 42.00000   59.5
#3: 3   187    96 62.33333   48.0
#4: 4    NA    81       NA   81.0

Answer 3

Data.table objects inherit from 'data.frame' so you can just use tapply: Data.table对象继承自“ data.frame”，因此您可以使用tapply：

> tapply(DT$v,list(DT$x, DT$y), FUN=sum)
   AA  BB
a  72 123
b  84 119
c 162  96

Answer 4

You can use dcast from reshape2 library. 您可以从reshape2库使用dcast 。 Here is the code 这是代码

# DUMMY DATA
library(data.table)
mydf = data.table(
  x = rep(1:3, each = 4),
  y = rep(c('A', 'B'), times = 2),
  v = rpois(12, 30)
)

# USE RESHAPE2
library(reshape2)
dcast(mydf, x ~ y, fun = sum, value_var = "v")

NOTE: The tapply solution would be much faster. 注意： tapply解决方案将更快。

正确/最快地重塑数据表的方式

问题描述

4 个解决方案

解决方案1
73 已采纳 2011-08-02 13:52:14

解决方案2
32 2013-03-19 23:25:57

This feature is now implemented into data.table (from version 1.8.11 on), as can be seen in Zach's answer above. 现在可以在data.table中实现此功能（从版本1.8.11开始），如上面Zach的答案所示。

解决方案3
21 2011-08-01 17:31:43

解决方案4
7 2011-08-01 17:35:09

正确/最快地重塑数据表的方式

问题描述

4 个解决方案

解决方案1 73 已采纳 2011-08-02 13:52:14

解决方案2 32 2013-03-19 23:25:57

This feature is now implemented into data.table (from version 1.8.11 on), as can be seen in Zach's answer above. 现在可以在data.table中实现此功能（从版本1.8.11开始），如上面Zach的答案所示。

解决方案3 21 2011-08-01 17:31:43

解决方案4 7 2011-08-01 17:35:09

解决方案1
73 已采纳 2011-08-02 13:52:14

解决方案2
32 2013-03-19 23:25:57

解决方案3
21 2011-08-01 17:31:43

解决方案4
7 2011-08-01 17:35:09