[英]Convert Mixed-Length named List to data.frame
I have a list of the following format: 我有以下格式的列表:
[[1]]
[[1]]$a
[1] 1
[[1]]$b
[1] 3
[[1]]$c
[1] 5
[[2]]
[[2]]$c
[1] 2
[[2]]$a
[1] 3
There is a predefined list of possible "keys" ( a
, b
, and c
, in this case) and each element in the list ("row") will have values defined for one or more of these keys. 存在可能的“键”(在这种情况下为
a
, b
和c
)的预定义列表,并且列表中的每个元素(“行”)将具有为这些键中的一个或多个定义的值。 I'm looking for a fast way to get from the list structure above to a data.frame which would look like the following, in this case: 我正在寻找一种从上面的列表结构到一个data.frame的快速方法,在这种情况下,它将如下所示:
a b c
1 1 3 5
2 3 NA 2
Any help would be appreciated! 任何帮助,将不胜感激!
Appendix 附录
I'm dealing with a table that will have up to 50,000 rows and 3-6 columns, with most of the values specified. 我正在处理一个最多有50,000行和3-6列的表,其中指定了大多数值。 I'll be taking the table in from JSON and trying to quickly get it into data.frame structure.
我将从JSON中获取表格并尝试快速将其转换为data.frame结构。
Here's some code to create a sample list of the scale with which I'll be working: 以下是一些代码,用于创建我将使用的比例的样本列表:
ids <- c("a", "b", "c")
createList <- function(approxSize=100){
set.seed(1234)
fifth <- round(approxSize/5)
list <- list()
list[1:(fifth*5)] <- rep(
list(list(a=1, b=2, c=3),
list(a=3, b=4, c=5),
list(a=7, c=9),
list(c=6, a=8, b=3),
list(b=6)),
fifth)
list
}
Just create a list with approxSize
of 50,000 to test the performance on a list of this size. 只需创建一个
approxSize
50,000的列表,即可测试此大小列表的性能。
Here is a short answer, I doubt it will be very fast though. 这是一个简短的答案,我怀疑它会非常快。
> library(plyr)
> rbind.fill(lapply(x, as.data.frame))
a b c
1 1 3 5
2 3 NA 2
Here's my initial thought. 这是我最初的想法。 It doesn't speed up your approach, but it does simplify the code considerably:
它不会加速你的方法,但它确实大大简化了代码:
# makeDF <- function(List, Names) {
# m <- t(sapply(List, function(X) unlist(X)[Names],
# as.data.frame(m)
# }
## vapply() is a bit faster than sapply()
makeDF <- function(List, Names) {
m <- t(vapply(List,
FUN = function(X) unlist(X)[Names],
FUN.VALUE = numeric(length(Names))))
as.data.frame(m)
}
## Test timing with a 50k-item list
ll <- createList(50000)
nms <- c("a", "b", "c")
system.time(makeDF(ll, nms))
# user system elapsed
# 0.47 0.00 0.47
If you know the possible values beforehand, and you are dealing with large data, perhaps using data.table
and set
will be fast 如果您事先知道可能的值,并且您正在处理大数据,那么使用
data.table
和set
会很快
cc <- createList(50000)
system.time({
nas <- rep.int(NA_real_, length(cc))
DT <- setnames(as.data.table(replicate(length(ids),nas, simplify = FALSE)), ids)
for(xx in seq_along(cc)){
.n <- names(cc[[xx]])
for(j in .n){
set(DT, i = xx, j = j, value = cc[[xx]][[j]])
}
}
})
# user system elapsed
# 0.68 0.01 0.70
full <- c('a','b', 'c')
system.time({
for(xx in seq_along(cc)) {
mm <- setdiff(full, names(cc[[xx]]))
if(length(mm) || all(names(cc[[xx]]) == full)){
cc[[xx]] <- as.data.table(cc[[xx]])
# any missing columns
if(length(mm)){
# if required add additional columns
cc[[xx]][, (mm) := as.list(rep(NA_real_, length(mm)))]
}
# put columns in correct order
setcolorder(cc[[xx]], full)
}
}
cdt <- rbindlist(cc)
})
# user system elapsed
# 21.83 0.06 22.00
This second solution has been left here to show how data.table
can be used poorly. 第二个解决方案留在这里,以显示如何
data.table
地使用data.table
。
I know this is an old question, but I just came across it and it's excruciating not to see the simplest solution I'm aware of. 我知道这是一个古老的问题,但我刚刚遇到它,并且看到我所知道的最简单的解决方案是令人难以忍受的。 So here it is (simply specify 'fill=TRUE' in rbindlist):
所以这里(简单地在rbindlist中指定'fill = TRUE'):
library(data.table)
list = list(list(a=1,b=3,c=5),list(c=2,a=3))
rbindlist(list,fill=TRUE)
# a b c
# 1: 1 3 5
# 2: 3 NA 2
I don't know if this is the fastest way, but I'd be willing to bet that it competes, given data.table's thoughtful design and extremely good performance on a lot of other tasks. 我不知道这是否是最快的方式,但我愿意打赌它会竞争,因为data.table的设计周到,并且在很多其他任务上都表现出色。
Well, I gave my first thought a try and the performance wasn't as bad as I was afraid of, but I'm sure there's still room for improvement (especially in the waster matrix -> data.frame conversion). 好吧,我第一次尝试了,性能并没有我害怕的那么糟糕,但我确信还有改进的余地(特别是在废弃矩阵 - > data.frame转换中)。
convertList <- function(myList, ids){
#this computes a list of the numerical index for each value to handle the missing/
# improperly ordered list elements. So it will have a list in which each element
# associated with A has a value of 1, B ->2, and C -> 3. So a row containing
# A=_, C=_, B=_ would have a value of `1,3,2`
idInd <- lapply(myList, function(x){match(names(x), ids)})
# Calculate the row indices if I were to unlist myList. So if there were two elements
# in the first row, 3 in the third, and 1 in the fourth, you'd see: 1, 1, 2, 2, 2, 3
rowInd <- inverse.rle(list(values=1:length(myList), lengths=sapply(myList, length)))
#Unlist the first list created to just be a numerical matrix
idInd <- unlist(idInd)
#create a grid of addresses. The first column is the row address, the second is the col
address <- cbind(rowInd, idInd)
#have to use a matrix because you can't assign a data.frame
# using an addressing table like we have above
mat <- matrix(ncol=length(ids), nrow=length(myList))
# assign the values to the addresses in the matrix
mat[address] <- unlist(myList)
# convert to data.frame
df <- as.data.frame(mat)
colnames(df) <- ids
df
}
myList <- createList(50000)
ids <- letters[1:3]
system.time(df <- convertList(myList, ids))
It's taking about 0.29 seconds to convert the 50,000 rows on my laptop (Windows 7, Intel i7 M620 @ 2.67 GHz, 4GB RAM). 在我的笔记本电脑上转换50,000行(Windows 7,Intel i7 M620 @ 2.67 GHz,4GB RAM)需要大约0.29秒。
Still very much interested in other answers! 对其他答案仍然非常感兴趣!
In dplyr: 在dplyr中:
bind_rows(lapply(x, as_data_frame))
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 1 3 5
2 3 NA 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.