简体   繁体   English

正确/最快地重塑数据表的方式

[英]Proper/fastest way to reshape a data.table

I have a data table in R: 我在R中有一个数据表

library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
      x y  v
 [1,] 1 A 12
 [2,] 1 B 62
 [3,] 1 A 60
 [4,] 1 B 61
 [5,] 2 A 83
 [6,] 2 B 97
 [7,] 2 A  1
 [8,] 2 B 22
 [9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49

I can easily sum the variable v by the groups in the data.table: 我可以通过data.table中的组轻松地对变量v求和:

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out
     x  y SUM
[1,] 1 A  72
[2,] 1 B 123
[3,] 2 A  84
[4,] 2 B 119
[5,] 3 A 162
[6,] 3 B  96

However, I would like to have the groups (y) as columns, rather than rows. 但是,我想将组(y)作为列而不是行。 I can accomplish this using reshape : 我可以使用reshape完成此操作:

out <- reshape(out,direction='wide',idvar='x', timevar='y')
out
     x SUM.A SUM.B
[1,] 1    72   123
[2,] 2    84   119
[3,] 3   162    96

Is there a more efficient way to reshape the data after aggregating it? 汇总数据后,是否有更有效的方式来重塑数据? Is there any way to combine these operations into one step, using the data.table operations? 是否有任何方法可以使用data.table操作将这些操作组合为一个步骤?

The data.table package implements faster melt/dcast functions (in C). data.table包实现了更快的melt/dcast功能(用C语言melt/dcast )。 It also has additional features by allowing to melt and cast multiple columns . 通过允许熔化和浇铸多列 ,它还具有其他功能。 Please see the new Efficient reshaping using data.tables on Github. 请在Github上使用data.tables查看新的高效重塑

melt/dcast functions for data.table have been available since v1.9.0 and the features include: 从v1.9.0版本开始提供data.table的melt / dcast功能,其功能包括:

  • There is no need to load reshape2 package prior to casting. 铸造前无需加载reshape2包装。 But if you want it loaded for other operations, please load it before loading data.table . 但是,如果您希望将其加载以进行其他操作,请加载data.table 之前先加载data.table

  • dcast is also a S3 generic. dcast也是S3的泛型。 No more dcast.data.table() . 没有更多的dcast.data.table() Just use dcast() . 只需使用dcast()

  • melt : melt

    • is capable of melting on columns of type 'list'. 能够融化“列表”类型的列。

    • gains variable.factor and value.factor which by default are TRUE and FALSE respectively for compatibility with reshape2 . 获得variable.factorvalue.factor ,默认情况下分别为TRUEFALSE ,以与reshape2兼容。 This allows for directly controlling the output type of variable and value columns (as factors or not). 这允许直接控制variablevalue列的输出类型(是否为因子)。

    • melt.data.table 's na.rm = TRUE parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient. melt.data.tablena.rm = TRUE参数进行内部优化,以在融化过程中直接去除NA,因此效率更高。

    • NEW: melt can accept a list for measure.vars and columns specified in each element of the list will be combined together. 新增内容: melt可以接受一个measure.vars列表measure.vars列表中每个元素中指定的measure.vars和列将合并在一起。 This is faciliated further through the use of patterns() . 通过使用patterns()可以进一步简化此过程。 See vignette or ?melt . 参见晕影或?melt

  • dcast : dcast

    • accepts multiple fun.aggregate and multiple value.var . 接受多个fun.aggregate和多个value.var See vignette or ?dcast . 参见小插图或?dcast

    • use rowid() function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. 直接在公式中使用rowid()函数生成一个id列,有时需要该ID来唯一标识行。 See ?dcast. 参见?dcast。

  • Old benchmarks: 旧基准:

    • melt : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds. melt :1000万行和5列,从61.3秒减少到1.2秒。
    • dcast : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds. dcast :1百万行4列,从192秒减少到3.6秒。

Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast pull request to reshape2 ? 《科隆提醒(2013年12月)》演示幻灯片32: 为什么不提交dcast pull请求来reshape2

This feature is now implemented into data.table (from version 1.8.11 on), as can be seen in Zach's answer above. 现在可以在data.table中实现此功能(从版本1.8.11开始),如上面Zach的答案所示。

I just saw this great chunk of code from Arun here on SO . 我刚刚在SO上看到了来自Arun的大量代码。 So I guess there is a data.table solution. 所以我想有一个data.table解决方案。 Applied to this problem: 应用于此问题:

library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=1e6), 
                  y=c("A","B"), 
                  v=sample(1:100,12))

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
# edit (mnel) to avoid setNames which creates a copy
# when calling `names<-` inside the function
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]
})
   x        A        B
1: 1 26499966 28166677
2: 2 26499978 28166673
3: 3 26500056 28166650

This gives the same results as DWin's approach: 这与DWin的方法具有相同的结果:

tapply(DT$v,list(DT$x, DT$y), FUN=sum)
         A        B
1 26499966 28166677
2 26499978 28166673
3 26500056 28166650

Also, it is fast: 而且,它很快:

system.time({ 
   out <- DT[,list(SUM=sum(v)),by=list(x,y)]
   out[, as.list(setattr(SUM, 'names', y)), by=list(x)]})
##  user  system elapsed 
## 0.64    0.05    0.70 
system.time(tapply(DT$v,list(DT$x, DT$y), FUN=sum))
## user  system elapsed 
## 7.23    0.16    7.39 

UPDATE 更新

So that this solution also works for non-balanced data sets (ie some combinations do not exist), you have to enter those in the data table first: 为了使该解决方案也适用于非平衡数据集(即某些组合不存在),您必须首先在数据表中输入这些组合:

library(data.table)
set.seed(1234)
DT <- data.table(x=c(rep(c(1,2,3),each=4),3,4), y=c("A","B"), v=sample(1:100,14))

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
setkey(out, x, y)

intDT <- expand.grid(unique(out[,x]), unique(out[,y]))
setnames(intDT, c("x", "y"))
out <- out[intDT]

out[, as.list(setattr(SUM, 'names', y)), by=list(x)]

Summary 摘要

Combining the comments with the above, here's the 1-line solution: 结合上面的评论,这是一线解决方案:

DT[, sum(v), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
   setNames(as.list(V1), paste(y)), by = x]

It's also easy to modify this to have more than just the sum, eg: 也可以很容易地修改它,使其不仅具有总和,例如:

DT[, list(sum(v), mean(v)), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
   setNames(as.list(c(V1, V2)), c(paste0(y,".sum"), paste0(y,".mean"))), by = x]
#   x A.sum B.sum   A.mean B.mean
#1: 1    72   123 36.00000   61.5
#2: 2    84   119 42.00000   59.5
#3: 3   187    96 62.33333   48.0
#4: 4    NA    81       NA   81.0

Data.table objects inherit from 'data.frame' so you can just use tapply: Data.table对象继承自“ data.frame”,因此您可以使用tapply:

> tapply(DT$v,list(DT$x, DT$y), FUN=sum)
   AA  BB
a  72 123
b  84 119
c 162  96

You can use dcast from reshape2 library. 您可以从reshape2库使用dcast Here is the code 这是代码

# DUMMY DATA
library(data.table)
mydf = data.table(
  x = rep(1:3, each = 4),
  y = rep(c('A', 'B'), times = 2),
  v = rpois(12, 30)
)

# USE RESHAPE2
library(reshape2)
dcast(mydf, x ~ y, fun = sum, value_var = "v")

NOTE: The tapply solution would be much faster. 注意: tapply解决方案将更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM