简体   繁体   English

R使用列表列重新融合data.table

[英]R reshaping melted data.table with list column

I have a large (millions of rows) melted data.table with the usual melt -style unrolling in the variable and value columns. 我有一个大的(数百万行)熔化的data.table与通常的melt样式展开variablevalue列。 I need to cast the table in wide form (rolling the variables up). 我需要以宽泛的形式转换表(滚动变量)。 The problem is that the data table also has a list column called data , which I need to preserve. 问题是数据表还有一个名为data的列表列,我需要保留它。 This makes it impossible to use reshape2 because dcast cannot deal with non-atomic columns. 这使得无法使用reshape2因为dcast无法处理非原子列。 Therefore, I need to do the rolling up myself. 因此,我需要自己卷起来。

The answer from a previous question about working with melted data tables does not apply here because of the list column. 由于列表列,上一个关于使用熔化数据表的问题的答案在这里不适用。

I am not satisfied with the solution I've come up with. 我对我提出的解决方案不满意。 I'm looking for suggestions for a simpler/faster implementation. 我正在寻找更简单/更快实现的建议。

x <- LETTERS[1:3]
dt <- data.table(
  x=rep(x, each=2),
  y='d',
  data=list(list(), list(), list(), list(), list(), list()),
  variable=rep(c('var.1', 'var.2'), 3),
  value=seq(1,6)
  )

# Column template set up
list_template <- Reduce(
  function(l, col) { l[[col]] <- col; l }, 
  unique(dt$variable),
  list())

# Expression set up
q <- substitute({
  l <- lapply(
    list_template, 
    function(col) .SD[variable==as.character(col)]$value)
  l$data = .SD[1,]$data
  l
}, list(list_template=list_template))

# Roll up
dt[, eval(q), by=list(x, y)]

   x y var.1 var.2   data
1: A d     1     2 <list>
2: B d     3     4 <list>
3: C d     5     6 <list>

I have somewhat cheating method that might do the trick - importantly, I assume that each x,y,list combination is unique! 我有一些欺骗方法可能会做到这一点 - 重要的是,我认为每个x,y,列表组合都是独一无二的! If not, please disregard. 如果没有,请忽略。

I'm going to create two separate datatables, the first which is dcasted without the data list objects, and the second which has only the unique data list objects and a key. 我将创建两个单独的数据表,第一个是没有数据列表对象的数据,第二个只有唯一的数据列表对象和一个键。 Then just merge them together to get the desired result. 然后将它们合并在一起以获得所需的结果。

require(data.table)
require(stringr)
require(reshape2)

x <- LETTERS[1:3]
dt <- data.table(
  x=rep(x, each=2),
  y='d',
  data=list(list("a","b"), list("c","d")),
  variable=rep(c('var.1', 'var.2'), 3),
  value=seq(1,6)
  )


# First create the dcasted datatable without the pesky list objects:
dt_nolist <- dt[,list(x,y,variable,value)]
dt_dcast <- data.table(dcast(dt_nolist,x+y~variable,value.var="value")
                       ,key=c("x","y"))


# Second: create a datatable with only unique "groups" of x,y, list
dt_list <- dt[,list(x,y,data)]

# Rows are duplicated so I'd like to use unique() to get rid of them, but
# unique() doesn't work when there's list objects in the data.table.
# Instead so I cheat by applying a value to each row within an x,y "group" 
# that is unique within EACH group, but present within EVERY group.
# Then just simply subselect based on that unique value.
# I've chosen rank(), but no doubt there's other options

dt_list <- dt_list[,rank:=rank(str_c(x,y),ties.method="first"),by=str_c(x,y)]

# now keep only one row per x,y "group"
dt_list <- dt_list[rank==1]
setkeyv(dt_list,c("x","y"))

# drop the rank since we no longer need it
dt_list[,rank:=NULL]

# Finally just merge back together
dt_final <- merge(dt_dcast,dt_list)

This old question piqued my curiosity as data.table has been improved sigificantly since 2013. 这个古老的问题激起了我的好奇心,因为自2013年以来,数据data.table已经大大改善。

However, even with data.table version 1.11.4 但是,即使使用data.table版本1.11.4

dcast(dt, x + y + data ~ variable)

still returns an error 仍然会返回错误

Columns specified in formula can not be of type list 公式中指定的列不能是类型列表

The workaround follows the general outline of jonsedar's answer : 解决方法遵循jonsedar的答案的大致轮廓:

  1. Reshape the non-list columns from long to wide format 将非列表列从长格式重新格式化为宽格式
  2. Aggregate the list column data grouped by x and y 聚合按xy分组的列表列data
  3. Join the two partial results on x and y xy上加入两个部分结果

but uses the features of the actual data.table syntax, eg, the on parameter: 但使用实际data.table语法的功能,例如on参数:

dcast(dt, x + y ~ variable)[
  dt[, .(data = .(first(data))), by = .(x, y)], on = .(x, y)] 
  xy var.1 var.2 data 1: A d 1 2 <list> 2: B d 3 4 <list> 3: C d 5 6 <list> 

The list column data is aggregated by taking the first element. 通过获取第一个元素来聚合列表列data This is in line with OP's code line 这符合OP的代码行

l$data = .SD[1,]$data

which also picks the first element. 这也选择了第一个元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM