简体   繁体   English

展平具有复杂嵌套结构的列表

[英]Flatten a list with complex nested structure

I have a list with the following example structure: 我有一个包含以下示例结构的列表:

> dput(test)
structure(list(id = 1, var1 = 2, var3 = 4, section1 = structure(list(
    var1 = 1, var2 = 2, var3 = 3), .Names = c("var1", "var2", 
"var3")), section2 = structure(list(row = structure(list(var1 = 1, 
    var2 = 2, var3 = 3), .Names = c("var1", "var2", "var3")), 
    row = structure(list(var1 = 4, var2 = 5, var3 = 6), .Names = c("var1", 
    "var2", "var3")), row = structure(list(var1 = 7, var2 = 8, 
        var3 = 9), .Names = c("var1", "var2", "var3"))), .Names = c("row", 
"row", "row"))), .Names = c("id", "var1", "var3", "section1", 
"section2"))


> str(test)
List of 5
 $ id      : num 1
 $ var1    : num 2
 $ var3    : num 4
 $ section1:List of 3
  ..$ var1: num 1
  ..$ var2: num 2
  ..$ var3: num 3
 $ section2:List of 3
  ..$ row:List of 3
  .. ..$ var1: num 1
  .. ..$ var2: num 2
  .. ..$ var3: num 3
  ..$ row:List of 3
  .. ..$ var1: num 4
  .. ..$ var2: num 5
  .. ..$ var3: num 6
  ..$ row:List of 3
  .. ..$ var1: num 7
  .. ..$ var2: num 8
  .. ..$ var3: num 9

Notice that the section2 list contains elements named rows . 请注意, section2列表包含名为rows元素。 These represent multiple records. 这些代表多个记录。 What I have is a nested list where some elements are at the root level and others are multiple nested records for the same observation. 我所拥有的是嵌套列表,其中一些元素位于根级别,而其他元素是同一观察的多个嵌套记录。 I would like the following output in a data.frame format: 我想以data.frame格式输出以下内容:

> desired
  id var1 var3 section1.var1 section1.var2 section1.var3 section2.var1 section2.var2 section2.var3
1  1    2    4             1             2               3             1             4             7
2 NA   NA   NA            NA            NA              NA             2             5             8
3 NA   NA   NA            NA            NA              NA             3             6             9

Root-level elements should populate the first row, while row elements should have their own rows. 根级元素应填充第一行,而row元素应具有自己的行。 As an added complication, the number of variables in the row entries can vary. 作为一个额外的复杂因素, row条目中的变量数量可能会有所不同。

Here's a general approach. 这是一种通用的方法。 It doesn't assume that you'll have only three row; 它并不假设你只有三排; it will work with however many rows you have. 它可以使用你拥有的很多行。 And if a value is missing in the nested structure (eg var1 doesn't exist for some sub-lists in section2), the code correctly returns an NA for that cell. 如果嵌套结构中缺少值(例如,第2节中的某些子列表不存在var1),则代码会正确返回该单元格的NA。

Eg if we use the following data: 例如,如果我们使用以下数据:

test <- structure(list(id = 1, var1 = 2, var3 = 4, section1 = structure(list(var1 = 1, var2 = 2, var3 = 3), .Names = c("var1", "var2", "var3")), section2 = structure(list(row = structure(list(var1 = 1, var2 = 2), .Names = c("var1", "var2")), row = structure(list(var1 = 4, var2 = 5), .Names = c("var1", "var2")), row = structure(list( var2 = 8, var3 = 9), .Names = c("var2", "var3"))), .Names = c("row", "row", "row"))), .Names = c("id", "var1", "var3", "section1", "section2"))

The general approach is to use melt to create a dataframe that includes information about the nested structure, and then dcast to mold it into the format you desire. 一般的方法是使用melt来创建一个包含嵌套结构信息的数据框,然后dcast将其塑造成你想要的格式。

library("reshape2")

flat <- unlist(test, recursive=FALSE)
names(flat)[grep("row", names(flat))] <- gsub("row", "var", paste0(names(flat)[grep("row", names(flat))], seq_len(length(names(flat)[grep("row", names(flat))]))))  ## keeps track of rows by adding an ID
ul <- melt(unlist(flat))
split <- strsplit(rownames(ul), split=".", fixed=TRUE) ## splits the names into component parts
max <- max(unlist(lapply(split, FUN=length)))
pad <- function(a) {
  c(a, rep(NA, max-length(a)))
}
levels <- matrix(unlist(lapply(split, FUN=pad)), ncol=max, byrow=TRUE)

## Get the nesting structure
nested <- data.frame(levels, ul)
nested$X3[is.na(nested$X3)] <- levels(as.factor(nested$X3))[[1]]
desired <- dcast(nested, X3~X1 + X2)
names(desired) <- gsub("_", "\\.", gsub("_NA", "", names(desired)))
desired <- desired[,names(flat)]

> desired
  ## id var1 var3 section1.var1 section1.var2 section1.var3 section2.var1 section2.var2 section2.var3
## 1  1    2    4             1             2             3             1             4             7
## 2 NA   NA   NA            NA            NA            NA             2             5             8
## 3 NA   NA   NA            NA            NA            NA             3             6             9

The central idea of this solution is to flatten all sub-lists except the sub-lists named 'row'. 此解决方案的核心思想是展平除名为“row”的子列表之外的所有子列表。 This could be done by creating a unique ID for each list element (stored in z ) and then requesting that all elements within a single 'row' should have the same ID (stored in z2 ; had to write a recursive function to traverse the nested list). 这可以通过为每个列表元素(存储在z )创建唯一ID,然后请求单个'row'中的所有元素应该具有相同的ID(存储在z2 ;必须编写递归函数来遍历嵌套)来完成列表)。 Then, z2 could be used to group elements that belong to the same row. 然后, z2可用于对属于同一行的元素进行分组。 The resulting list can be converted into the matrix form using stri_list2matrix from the stringi package, and then converted into a data frame. 结果列表可以被转换成使用矩阵形式stri_list2matrixstringi包,然后转换成数据帧。

utest <- unlist(test)
z <- relist(seq_along(utest),test)

recurse <- function(L) {
    if (class(L)!='list') return(L)
    b <- names(L)=='row'
    L.b <- lapply(L[b],function(k) relist(rep(k[[1]],length(k)),k))
    L.nb <- lapply(L[!b],recurse)
    c(L.b,L.nb)
}

z2 <- unlist(recurse(z))

library(stringi)
desired <- as.data.frame(stri_list2matrix(split(utest,z2)))
names(desired) <- names(z2)[unique(z2)]

desired
#     id var1 var3 section1.var1 section1.var2 section1.var3 section2.row.var1
# 1    1    2    4             1             2             3                 1
# 2 <NA> <NA> <NA>          <NA>          <NA>          <NA>                 2
# 3 <NA> <NA> <NA>          <NA>          <NA>          <NA>                 3
#   section2.row.var1 section2.row.var1
# 1                 4                 7
# 2                 5                 8
# 3                 6                 9

Since your problem is not well defined when rows have complex structures (ie if each row in test contained the list test`, how should rows be bound together. Also what if rows in the same table have different structures?), the following solution depends on rows being a list of values. 由于当行具有复杂结构时(或者如果test中的每一行包含列表测试`,如果行应该如何绑定在一起,那么问题没有明确定义。另外,如果同一个表中的行具有不同的结构?),以下解决方案取决于在行上是值列表。

That said, I'm guessing that in the general case, your list test will contain either values, lists of values, or lists of rows (where rows are lists of values). 也就是说,我猜测在一般情况下,列表test将包含值,值列表或行列表(其中行是值列表)。 Also, if rows aren't always called "row" this solution still works. 此外,如果行不总是称为“行”,则此解决方案仍然有效。

temp <- lapply(test,
                function(x){
                    if(!is.list(x))
                        # x is a value
                        return(x)
                    # x is a lis of rows or values
                    out <- do.call(cbind,x)
                    if(nrow(out)>1){
                        # x is a list of rows 
                        colnames(out)<-paste0(colnames(out),'.',rownames(out))
                        rownames(out)<-rep_len(NA,nrow(out))
                    }
                    return(out)
                })

# a function that extends a matrix to a fixt number of rows (n)
# by appending rows of NA's 
rowExtend  <-  function(x,N){
                 if((!is.matrix(x)) ){
                     out<-do.call(rbind,c(list(x),as.list(rep_len(NA,N - 1))))
                     colnames(out) <- ""
                     out
                 }else if(nrow(x) < N)
                     do.call(rbind,c(list(x),as.list(rep_len(NA,N - nrow(x)))))
                 else
                     x
             }

# calculate the maximum number of rows
.nrows <- sapply(temp,nrow)
.nrows <- max(unlist(.nrows[!sapply(.nrows,is.null)]))

# extend the shorter rows
(temp2<-lapply(temp, rowExtend,.nrows))

# calculate new column namames
newColNames <- mapply(function(x,y) {
                       if(nzchar(y)[1L])
                           paste0(x,'.',y)
                       else x
                        },
                       names(temp2),
                       lapply(temp2,colnames))


do.call(cbind,mapply(`colnames<-`,temp2,newColNames))

#> id var1 var3 section1.var1 section1.var2 section1.var3 section2.row.var1 section2.row.var2 section2.row.var3
#> 1  2    4    1             2             3             1                 4                 7                
#> NA NA   NA   NA            NA            NA            2                 5                 8                
#> NA NA   NA   NA            NA            NA            3                 6                 9                

This starts similarly to tiffany's answer, but diverges a bit afterwards. 这与蒂芙尼的回答类似,但后来又有所不同。

library(data.table)

# flatten the first level
flat = unlist(test, recursive = FALSE)

# compute max length
N = max(sapply(flat, length))

# pad NA's and convert to data.table (at this point it will *look* like the right answer)
dt = as.data.table(lapply(flat, function(l) c(l, rep(NA, N - length(l)))))

# but in reality some of the columns are lists - check by running sapply(dt, class)
# so unlist them
dt = dt[, lapply(.SD, unlist)]
#   id var1 var3 section1.var1 section1.var2 section1.var3 section2.row section2.row section2.row
#1:  1    2    4             1             2             3            1            4            7
#2: NA   NA   NA            NA            NA            NA            2            5            8
#3: NA   NA   NA            NA            NA            NA            3            6            9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM