[英]Proper/fastest way to reshape a data.table
I have a data table in R: 我在R中有一个数据表 :
library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
x y v
[1,] 1 A 12
[2,] 1 B 62
[3,] 1 A 60
[4,] 1 B 61
[5,] 2 A 83
[6,] 2 B 97
[7,] 2 A 1
[8,] 2 B 22
[9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49
I can easily sum the variable v by the groups in the data.table: 我可以通过data.table中的组轻松地对变量v求和:
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out
x y SUM
[1,] 1 A 72
[2,] 1 B 123
[3,] 2 A 84
[4,] 2 B 119
[5,] 3 A 162
[6,] 3 B 96
However, I would like to have the groups (y) as columns, rather than rows. 但是,我想将组(y)作为列而不是行。 I can accomplish this using reshape
: 我可以使用reshape
完成此操作:
out <- reshape(out,direction='wide',idvar='x', timevar='y')
out
x SUM.A SUM.B
[1,] 1 72 123
[2,] 2 84 119
[3,] 3 162 96
Is there a more efficient way to reshape the data after aggregating it? 汇总数据后,是否有更有效的方式来重塑数据? Is there any way to combine these operations into one step, using the data.table operations? 是否有任何方法可以使用data.table操作将这些操作组合为一个步骤?
The data.table
package implements faster melt/dcast
functions (in C). data.table
包实现了更快的melt/dcast
功能(用C语言melt/dcast
)。 It also has additional features by allowing to melt and cast multiple columns . 通过允许熔化和浇铸多列 ,它还具有其他功能。 Please see the new Efficient reshaping using data.tables on Github. 请在Github上使用data.tables查看新的高效重塑 。
melt/dcast functions for data.table have been available since v1.9.0 and the features include: 从v1.9.0版本开始提供data.table的melt / dcast功能,其功能包括:
There is no need to load reshape2
package prior to casting. 铸造前无需加载reshape2
包装。 But if you want it loaded for other operations, please load it before loading data.table
. 但是,如果您希望将其加载以进行其他操作,请在加载data.table
之前先加载data.table
。
dcast
is also a S3 generic. dcast
也是S3的泛型。 No more dcast.data.table()
. 没有更多的dcast.data.table()
。 Just use dcast()
. 只需使用dcast()
。
melt
: melt
is capable of melting on columns of type 'list'. 能够融化“列表”类型的列。
gains variable.factor
and value.factor
which by default are TRUE
and FALSE
respectively for compatibility with reshape2
. 获得variable.factor
和value.factor
,默认情况下分别为TRUE
和FALSE
,以与reshape2
兼容。 This allows for directly controlling the output type of variable
and value
columns (as factors or not). 这允许直接控制variable
和value
列的输出类型(是否为因子)。
melt.data.table
's na.rm = TRUE
parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient. melt.data.table
的na.rm = TRUE
参数进行内部优化,以在融化过程中直接去除NA,因此效率更高。
NEW: melt
can accept a list for measure.vars
and columns specified in each element of the list will be combined together. 新增内容: melt
可以接受一个measure.vars
列表measure.vars
列表中每个元素中指定的measure.vars
和列将合并在一起。 This is faciliated further through the use of patterns()
. 通过使用patterns()
可以进一步简化此过程。 See vignette or ?melt
. 参见晕影或?melt
。
dcast
: dcast
:
accepts multiple fun.aggregate
and multiple value.var
. 接受多个fun.aggregate
和多个value.var
。 See vignette or ?dcast
. 参见小插图或?dcast
。
use rowid()
function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. 直接在公式中使用rowid()
函数生成一个id列,有时需要该ID来唯一标识行。 See ?dcast. 参见?dcast。
Old benchmarks: 旧基准:
melt
: 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds. melt
:1000万行和5列,从61.3秒减少到1.2秒。 dcast
: 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds. dcast
:1百万行4列,从192秒减少到3.6秒。 Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast
pull request to reshape2
? 《科隆提醒(2013年12月)》演示幻灯片32: 为什么不提交dcast
pull请求来reshape2
?
I just saw this great chunk of code from Arun here on SO . 我刚刚在SO上看到了来自Arun的大量代码。 So I guess there is a data.table
solution. 所以我想有一个data.table
解决方案。 Applied to this problem: 应用于此问题:
library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=1e6),
y=c("A","B"),
v=sample(1:100,12))
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
# edit (mnel) to avoid setNames which creates a copy
# when calling `names<-` inside the function
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]
})
x A B
1: 1 26499966 28166677
2: 2 26499978 28166673
3: 3 26500056 28166650
This gives the same results as DWin's approach: 这与DWin的方法具有相同的结果:
tapply(DT$v,list(DT$x, DT$y), FUN=sum)
A B
1 26499966 28166677
2 26499978 28166673
3 26500056 28166650
Also, it is fast: 而且,它很快:
system.time({
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]})
## user system elapsed
## 0.64 0.05 0.70
system.time(tapply(DT$v,list(DT$x, DT$y), FUN=sum))
## user system elapsed
## 7.23 0.16 7.39
UPDATE 更新
So that this solution also works for non-balanced data sets (ie some combinations do not exist), you have to enter those in the data table first: 为了使该解决方案也适用于非平衡数据集(即某些组合不存在),您必须首先在数据表中输入这些组合:
library(data.table)
set.seed(1234)
DT <- data.table(x=c(rep(c(1,2,3),each=4),3,4), y=c("A","B"), v=sample(1:100,14))
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
setkey(out, x, y)
intDT <- expand.grid(unique(out[,x]), unique(out[,y]))
setnames(intDT, c("x", "y"))
out <- out[intDT]
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]
Summary 摘要
Combining the comments with the above, here's the 1-line solution: 结合上面的评论,这是一线解决方案:
DT[, sum(v), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
setNames(as.list(V1), paste(y)), by = x]
It's also easy to modify this to have more than just the sum, eg: 也可以很容易地修改它,使其不仅具有总和,例如:
DT[, list(sum(v), mean(v)), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
setNames(as.list(c(V1, V2)), c(paste0(y,".sum"), paste0(y,".mean"))), by = x]
# x A.sum B.sum A.mean B.mean
#1: 1 72 123 36.00000 61.5
#2: 2 84 119 42.00000 59.5
#3: 3 187 96 62.33333 48.0
#4: 4 NA 81 NA 81.0
Data.table objects inherit from 'data.frame' so you can just use tapply: Data.table对象继承自“ data.frame”,因此您可以使用tapply:
> tapply(DT$v,list(DT$x, DT$y), FUN=sum)
AA BB
a 72 123
b 84 119
c 162 96
You can use dcast
from reshape2
library. 您可以从reshape2
库使用dcast
。 Here is the code 这是代码
# DUMMY DATA
library(data.table)
mydf = data.table(
x = rep(1:3, each = 4),
y = rep(c('A', 'B'), times = 2),
v = rpois(12, 30)
)
# USE RESHAPE2
library(reshape2)
dcast(mydf, x ~ y, fun = sum, value_var = "v")
NOTE: The tapply
solution would be much faster. 注意: tapply
解决方案将更快。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.