简体   繁体   English

如何取消组合data.table中的列列?

[英]How to ungroup list columns in data.table?

tidyr provides the unnest function that help expanding list columns. tidyr提供UNNEST功能,帮助扩大列表列。

This is similar to the much (20x) faster ungroup function in kdb. 这类似于kdb中更多(20x)更快的ungroup函数。

I am looking for a similar (but much faster) function that, assuming a data.table that contains several list columns, each with the same number of element on each row, would expand the data.table. 我正在寻找一个类似(但速度更快)的函数,假设data.table包含多个列表列,每个列在每行上具有相同数量的元素,将扩展data.table。

This an extension of this post . 这是这篇文章的延伸。

library(data.table)
library(tidyr)
t = Sys.time()
DT = data.table(a=c(1,2,3),
                b=c('q','w','e'),
                c=list(rep(t,2),rep(t+1,3),rep(t,0)),
                d=list(rep(1,2),rep(20,3),rep(1,0)))

print(DT)
   a b                                                           c        d
1: 1 q                     2016-01-09 09:55:14,2016-01-09 09:55:14      1,1
2: 2 w 2016-01-09 09:55:15,2016-01-09 09:55:15,2016-01-09 09:55:15 20,20,20
3: 3 e                                                                     

print(unnest(DT))
Source: local data frame [5 x 4]

      a     b                   c     d
  (dbl) (chr)              (time) (dbl)
1     1     q 2016-01-09 09:55:14     1
2     1     q 2016-01-09 09:55:14     1
3     2     w 2016-01-09 09:55:15    20
4     2     w 2016-01-09 09:55:15    20
5     2     w 2016-01-09 09:55:15    20

Here is my own attempt... that seems to be 2x quicker but should be largely improved... 这是我自己的尝试......似乎要快2倍,但应该大大改善......

dtUngroup <- function(DT){
  colClasses <- lapply(DT,FUN=class)
  listCols <- colnames(DT)[colClasses=='list']
  if(length(listCols)>0){
    nonListCols <- setdiff(colnames(DT),listCols)
    nbListElem <- unlist(DT[,lapply(.SD,FUN=lengths),.SDcols=(listCols[1L])])
    DT1 <- DT[,lapply(.SD,FUN=rep,times=(nbListElem)),.SDcols=(nonListCols)]
    DT1[,(listCols):=DT[,lapply(.SD,FUN=function(x) do.call('c',x)),.SDcols=(listCols)]]
    return(DT1)
  }
  return(DT)
} 
dtUngroup(DT)[]
   a b                   c  d
1: 1 q 2016-01-09 09:55:14  1
2: 1 q 2016-01-09 09:55:14  1
3: 2 w 2016-01-09 09:55:15 20
4: 2 w 2016-01-09 09:55:15 20
5: 2 w 2016-01-09 09:55:15 20

Using: 使用:

na.omit(DT[, lapply(.SD, unlist), a][, c := as.POSIXct(c, origin="1970-01-01")])

gives: 得到:

   a b                   c  d
1: 1 q 2016-01-09 12:17:24  1
2: 1 q 2016-01-09 12:17:24  1
3: 2 w 2016-01-09 12:17:25 20
4: 2 w 2016-01-09 12:17:25 20
5: 2 w 2016-01-09 12:17:25 20

When the values in the a column are not unique for each row, you can use: 如果a列中的值对于每一行都不是唯一的,则可以使用:

na.omit(DT[, lapply(.SD, unlist), by=1:nrow(DT)][, c := as.POSIXct(c, origin="1970-01-01")])

A benchmarck: 一个基准测试:

> microbenchmark(dtUngroup(DT)[], jaap())
Unit: milliseconds
            expr      min       lq     mean   median       uq      max neval cld
 dtUngroup(DT)[] 3.935677 4.005596 4.189208 4.066196 4.227372 6.750338   100   b
          jaap() 1.977175 2.039830 2.094536 2.074314 2.132525 2.309848   100  a 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM