简体   繁体   English

将混合长度命名列表转换为data.frame

[英]Convert Mixed-Length named List to data.frame

I have a list of the following format: 我有以下格式的列表:

[[1]]
[[1]]$a
[1] 1

[[1]]$b
[1] 3

[[1]]$c
[1] 5

[[2]]       
[[2]]$c
[1] 2

[[2]]$a
[1] 3

There is a predefined list of possible "keys" ( a , b , and c , in this case) and each element in the list ("row") will have values defined for one or more of these keys. 存在可能的“键”(在这种情况下为abc )的预定义列表,并且列表中的每个元素(“行”)将具有为这些键中的一个或多个定义的值。 I'm looking for a fast way to get from the list structure above to a data.frame which would look like the following, in this case: 我正在寻找一种从上面的列表结构到一个data.frame的快速方法,在这种情况下,它将如下所示:

  a  b c
1 1  3 5
2 3 NA 2

Any help would be appreciated! 任何帮助,将不胜感激!


Appendix 附录

I'm dealing with a table that will have up to 50,000 rows and 3-6 columns, with most of the values specified. 我正在处理一个最多有50,000行和3-6列的表,其中指定了大多数值。 I'll be taking the table in from JSON and trying to quickly get it into data.frame structure. 我将从JSON中获取表格并尝试快速将其转换为data.frame结构。

Here's some code to create a sample list of the scale with which I'll be working: 以下是一些代码,用于创建我将使用的比例的样本列表:

ids <- c("a", "b", "c")
createList <- function(approxSize=100){     
    set.seed(1234)

    fifth <- round(approxSize/5)

    list <- list()
    list[1:(fifth*5)] <- rep(
        list(list(a=1, b=2, c=3), 
                 list(a=3, b=4, c=5),
                 list(a=7, c=9),
                 list(c=6, a=8, b=3),
                 list(b=6)), 
        fifth)

    list
}

Just create a list with approxSize of 50,000 to test the performance on a list of this size. 只需创建一个approxSize 50,000的列表,即可测试此大小列表的性能。

Here is a short answer, I doubt it will be very fast though. 这是一个简短的答案,我怀疑它会非常快。

> library(plyr)
> rbind.fill(lapply(x, as.data.frame))
  a  b c
 1 1  3 5
 2 3 NA 2

Here's my initial thought. 这是我最初的想法。 It doesn't speed up your approach, but it does simplify the code considerably: 它不会加速你的方法,但它确实大大简化了代码:

# makeDF <- function(List, Names) {
#     m <- t(sapply(List, function(X) unlist(X)[Names], 
#     as.data.frame(m)
# }    

## vapply() is a bit faster than sapply()
makeDF <- function(List, Names) {
    m <- t(vapply(List, 
                  FUN = function(X) unlist(X)[Names], 
                  FUN.VALUE = numeric(length(Names))))
    as.data.frame(m)
}

## Test timing with a 50k-item list
ll <- createList(50000)
nms <- c("a", "b", "c")

system.time(makeDF(ll, nms))
# user  system elapsed 
# 0.47    0.00    0.47 

If you know the possible values beforehand, and you are dealing with large data, perhaps using data.table and set will be fast 如果您事先知道可能的值,并且您正在处理大数据,那么使用data.tableset会很快

cc <- createList(50000)



system.time({
nas <- rep.int(NA_real_, length(cc))
DT <-  setnames(as.data.table(replicate(length(ids),nas, simplify = FALSE)), ids)

for(xx in seq_along(cc)){

  .n <- names(cc[[xx]])
  for(j in .n){
    set(DT, i = xx, j = j, value = cc[[xx]][[j]])
  }


}

})


# user  system elapsed 
# 0.68    0.01    0.70 

Old (slow solution) for posterity 后世的旧(缓慢的解决方案)

full <- c('a','b', 'c')

system.time({
for(xx in seq_along(cc)) {
  mm <- setdiff(full, names(cc[[xx]]))
  if(length(mm) || all(names(cc[[xx]]) == full)){
  cc[[xx]] <- as.data.table(cc[[xx]])
  # any missing columns

  if(length(mm)){
  # if required add additional columns
    cc[[xx]][, (mm) := as.list(rep(NA_real_, length(mm)))]
  }
  # put columns in correct order
  setcolorder(cc[[xx]], full) 
  }
}

 cdt <- rbindlist(cc)
})

#   user  system elapsed 
# 21.83    0.06   22.00 

This second solution has been left here to show how data.table can be used poorly. 第二个解决方案留在这里,以显示如何data.table地使用data.table

I know this is an old question, but I just came across it and it's excruciating not to see the simplest solution I'm aware of. 我知道这是一个古老的问题,但我刚刚遇到它,并且看到我所知道的最简单的解决方案是令人难以忍受的。 So here it is (simply specify 'fill=TRUE' in rbindlist): 所以这里(简单地在rbindlist中指定'fill = TRUE'):

library(data.table)
list = list(list(a=1,b=3,c=5),list(c=2,a=3))
rbindlist(list,fill=TRUE)

#    a  b c
# 1: 1  3 5
# 2: 3 NA 2

I don't know if this is the fastest way, but I'd be willing to bet that it competes, given data.table's thoughtful design and extremely good performance on a lot of other tasks. 我不知道这是否是最快的方式,但我愿意打赌它会竞争,因为data.table的设计周到,并且在很多其他任务上都表现出色。

Well, I gave my first thought a try and the performance wasn't as bad as I was afraid of, but I'm sure there's still room for improvement (especially in the waster matrix -> data.frame conversion). 好吧,我第一次尝试了,性能并没有我害怕的那么糟糕,但我确信还有改进的余地(特别是在废弃矩阵 - > data.frame转换中)。

convertList <- function(myList, ids){
    #this computes a list of the numerical index for each value to handle the missing/
    # improperly ordered list elements. So it will have a list in which each element 
    # associated with A has a value of 1, B ->2, and C -> 3. So a row containing
    # A=_, C=_, B=_ would have a value of `1,3,2`
    idInd <- lapply(myList, function(x){match(names(x), ids)})

    # Calculate the row indices if I were to unlist myList. So if there were two elements
    # in the first row, 3 in the third, and 1 in the fourth, you'd see: 1, 1, 2, 2, 2, 3
    rowInd <- inverse.rle(list(values=1:length(myList), lengths=sapply(myList, length)))

    #Unlist the first list created to just be a numerical matrix
    idInd <- unlist(idInd)

    #create a grid of addresses. The first column is the row address, the second is the col
    address <- cbind(rowInd, idInd)

    #have to use a matrix because you can't assign a data.frame 
    # using an addressing table like we have above
    mat <- matrix(ncol=length(ids), nrow=length(myList))

    # assign the values to the addresses in the matrix
    mat[address] <- unlist(myList)

    # convert to data.frame
    df <- as.data.frame(mat)
    colnames(df) <- ids

    df
}   
myList <- createList(50000)
ids <- letters[1:3]

system.time(df <- convertList(myList, ids))

It's taking about 0.29 seconds to convert the 50,000 rows on my laptop (Windows 7, Intel i7 M620 @ 2.67 GHz, 4GB RAM). 在我的笔记本电脑上转换50,000行(Windows 7,Intel i7 M620 @ 2.67 GHz,4GB RAM)需要大约0.29秒。

Still very much interested in other answers! 对其他答案仍然非常感兴趣!

In dplyr: 在dplyr中:

bind_rows(lapply(x, as_data_frame))

# A tibble: 2 x 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     3     5
2     3    NA     2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM