简体   繁体   English

从嵌套列表中提取到数据框

[英]Extracting from Nested list to data frame

I will put dput of what my list looks like at the bottom such that the q can be reproducible. 我会把dput的什么我的名单看起来像底部,使得Q可重复性。 The dput is of a not x . 该dput是ax

I have a big nested list called x that I'm trying to build a data frame from but cannot figure it out. 我有一个名为x的大型嵌套列表,我正在尝试构建一个数据框但无法弄明白。

I have done the first part: 我做了第一部分:

for(i in 1:3){a[[i]]<-x$results[[i]]$experiences
indx <- lengths(a)
zz <- as.data.frame(do.call(rbind,lapply(a, `length<-`, max(indx))))}

For this I used the following answer: Converting nested list (unequal length) to data frame 为此,我使用了以下答案: 将嵌套列表(不等长)转换为数据帧

This leaves me a data.frame with n columns for results where n is the max results for any i: 这给我留下了一个data.frame,其中n列为结果,其中n是任何i的最大结果:

  v1   v2   v3
1 NULL NULL NULL
2  *    *    *
3 NULL NULL NULL

Each * is another nested list in the format list(experience = list(duration = ... 每个*是格式列表中的另一个嵌套列表list(experience = list(duration = ...

For example the first * in row 2, column v1. 例如,第2行第一列*第v1列。 I don't want the total list. 我不想要总清单。 I only want: 我只想要:

a[[2]][[1]]$experience$start

or in terms of the original list x: 或者就原始列表x而言:

x$results[[2]]$experiences[[1]]$experience$start

I feel like I'm nearly there with some tweaks. 我觉得我几乎有一些调整。 I tried: 我试过了:

for(i in 1:3){a[[i]]<-x$results[[i]]$experiences
indx <- lengths(a)
for(y in 1:length(a[[i]])) aa <- rbind(aa,tryCatch(x$results[[i]]$experiences[[y]]$experience$start, error=function(e) print(NA)))
zz <- as.data.frame(do.call(rbind,lapply(aa, `length<-`, max(indx))))}

Resulting in: 导致:

  v1     v2     v3
1  NA     NA     NA
2  NA     NA     NA
3 2014    NA     NA
4 2012    NA     NA
5 2006    NA     NA
6  NA     NA     NA
7  NA     NA     NA 

Tried cbind instead of rbind on final line and that put all the dates in the first row. 在最后一行尝试了cbind而不是rbind,并将所有日期放在第一行。

I also tried the following: 我也尝试过以下方法:

for(i in 1:3){a[[i]]<-lengths(x$results[[i]]$experiences)
  indx <- lengths(a)
for(y in 1:length(indx)){tt[i] <- tryCatch(x$results[[i]]$experiences[[y]]$experience$start, error=function(e) print(""))}
zz <- as.data.frame(do.call(rbind,lapply(tt, `length<-`, max(indx))))}

This came close, builds the right format but only returns the first result: 这很接近,构建正确的格式但只返回第一个结果:

  v1   v2  v3
1 NA   NA  NA
2 2014 NA  NA
3 NA   NA  NA

The format I want is: 我想要的格式是:

 V1  V2  V3
1 NA  NA  NA
2 2014 2012 2006
3 NA  NA  NA

((Sample data now at bottom)) ((现在底部的样本数据))

Newest attempt: 最新尝试:

Doing the following but returns only the first start date from each a[[i]] , the second loop I need to make the list aa[i][y] something different. 执行以下操作但仅返回每个a[[i]]的第一个开始日期,第二个循环我需要使列表aa[i][y]不同。

 for(i in 1:3){a[[i]]<-x$results[[i]]$experiences
 for(y in 1:length(a[[i]])){aa[i][y] = if(is.null(a[[i]][[y]]$experience$start)){"NULL"}else{a[[i]][[y]]$experience$start}}}

So for dput2 I'd like the form: 所以对于dput2我喜欢这种形式:

  v1    v2  v3   v4   v5   v6   v7   v8
1 2015
2 2011 2007 null null null null null null
3 2016 2015 2015 2015 2013 2010

I dont mind if the blanks are null or na 我不介意空白是空还是娜

UPDATE UPDATE

The below answer almost works, however in my data the structure changes, the order of the names (roleName, duration etc) change so that ruins the answer as cumsum is used to determine when a new list is found. 下面的答案几乎可以正常工作,但是在我的数据中,结构发生了变化,名称的顺序(roleName,duration等)发生了变化,因此当使用cumsum确定何时找到新列表时,会破坏答案。 If you have duration then start the keys are 9 and 1 and the cumsum part labels them two different lists. 如果你有duration那么start键是91 ,而cumsum部分标记它们两个不同的列表。

I wrote the following: 我写了以下内容:

my.list <- list(structure(
  list(
    experience = structure(
      list(
        start = "1",
        end = "1",
        roleName = "a",
        summary = "a",
        duration = "a",
        current = "a",
        org = structure(list(name = "a", url = "a"), .Names = c("name","url")),
        location = structure(
          list(
            displayLocation = NULL,
            lat = NULL,
            lng = NULL
          ),
          .Names = c("displayLocation",
                     "lat", "lng")
        ) ),.Names = c("start", "end", "roleName", "summary", "duration", "current", "org", "location")),
    `_meta` = structure(
      list(weight = 1L, `_sources` = list(structure(
        list(`_origin` = "a"), .Names = "_origin"
      ))),.Names = c("weight", "_sources"))),.Names = c("experience", "_meta")))

Then: 然后:

aa <- lapply(1:length(a), function(y){tryCatch(lapply(1:length(a[[y]]), 
                     function(i){a[[y]][[i]]$experience[names(my.list2[[1]]$experience)]}), error=function(e) print(list()))})

This changes the structure such that key2 will always be in the right order. 这会改变结构,使key2始终处于正确的顺序。

However Then I found after this loop I have another issue. 然而,我发现在这个循环后我有另一个问题。

Sometimes I have for example nothing but a roleName in the experience list. 有时我只有经验列表中的roleName。 If that occurs twice in a row the keys are repeated. 如果连续两次出现,则重复键。 cumsum treats them as the same experience instead of separate ones. cumsum将它们视为相同的体验,而不是单独的体验。

This means I cannot create df3 because of duplicate identifiers for rows. 这意味着我无法创建df3因为行的重复标识符。 And even if I could by removing troublesome rows, the names wouldn't match as i in the solution below matches the names using the sequence, if I remove any rows that changes the lengths. 即使我可以通过删除麻烦的行,名称也不匹配,因为i在下面的解决方案中匹配使用序列的名称,如果我删除任何更改长度的行。

Here is my total code for more insight: 以下是我的更多见解的总代码:

for(i in 1:x$count){a[[i]]<-x$results[[i]]$experiences}

  aa <- lapply(1:length(a), function(y){tryCatch(lapply(1:length(a[[y]]), 
                     function(i){a[[y]][[i]]$experience[names(my.list2[[1]]$experience)]}), error=function(e) print(list()))})

  aaa <- unlist(aa)
  dummydf <- data.frame(b=c("start", "end", "roleName", "summary", 
                            "duration", "current", "org.name",  "org.url"), key=1:8)

  df <- data.frame(a=aaa, b=names(aaa))
  df2 <- left_join(df, dummydf)
  df2$key2 <- as.factor(cumsum(df2$key < c(0, df2$key[-length(df2$key)])) +1)

  df_split <- split(df2, df2$key2)
  df3 <- lapply(df_split, function(x){
    x %>% select(-c(key, key2)) %>% spread(b, a)
  }) %>% data.table::rbindlist(fill=TRUE) %>% t
  df3 <- data.frame(df3)
  i <- sapply(seq_along(aa), function(y) rep(y, sapply(aa, function(x) length(x))[y])) %>% unlist
  names(df3) <- paste0(names(df3), "_", i)
  df4 <- data.frame(t(df3))
  df4$dates <- as.Date(NA)
  df4$dates <- as.Date(df4$start)
  df4 <- data.frame(dates = df4$dates)
  df4 <- t(df4)
  df4 <- data.frame(df4)
  names(df4) <- paste0(names(df4), "_", i)
  df4[] <- lapply(df4[], as.character)
  l1 <- lapply(split(stack(df4), sub('.*_', '', stack(df4)[,2])), '[', 1)
  df5 <- t(do.call(cbindPad, l1))
  df5 <- data.frame(df5)

cbindpad taken from this question cbindpad取自这个问题

New sample code including the issues: 新示例代码包括以下问题:

dput3 = 
list(list(), list(
structure(list(experience = structure(list(
  duration = "1", start = "2014", 
  end = "3000", roleName = "a", 
  summary = "aaa", 
  org = structure(list(name = "a"), .Names = "name"), 
  location = structure(list(displayLocation = NULL, lat = NULL, 
    lng = NULL), .Names = c("displayLocation", "lat", "lng"
    ))), .Names = c("duration", "start", "end", "roleName", "summary", 
    "org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list(
      structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight", 
      "_sources"))), .Names = c("experience", "_meta")), 
structure(list(
        experience = structure(list(end = "3000", 
        start = "2012", duration = "2", 
        roleName = "a", summary = "aaa", 
        org = structure(list(name = "None"), .Names = "name"), 
        location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("duration", "start", "end", "roleName", 
        "summary", "org", "location")), `_meta` = structure(list(
          weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta")), 
  structure(list(
            experience = structure(list(duration = "3", 
            start = "2006", end = "3000", 
            roleName = "a", summary = "aaa", org = structure(list(name = " "), .Names = "name"), 
            location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("duration", "start", "end", "roleName",
            "summary", "org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list(structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight", 
            "_sources"))), .Names = c("experience", "_meta")),
  structure(list(
            experience = structure(list(roleName = "a",  
            location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("roleName", 
           "location")), `_meta` = structure(list(
            weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta")),
structure(list(
            experience = structure(list(roleName = "a",  
            location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("roleName", 
            "location")), `_meta` = structure(list(
            weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta"))
            ), 
            list(
structure(list(experience = structure(list(
              duration = "1", start = "2014", 
              end = "3000", roleName = "a", 
              summary = "aaa", 
              org = structure(list(name = "a"), .Names = "name"), 
              location = structure(list(displayLocation = NULL, lat = NULL, 
                lng = NULL), .Names = c("displayLocation", "lat", "lng"
                ))), .Names = c("duration", "start", "end", "roleName", "summary", 
                "org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list(
                  structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight", 
                  "_sources"))), .Names = c("experience", "_meta"))))

Maybe this can help 也许这可以帮助

library(dplyr)
library(tidyr)

a <- unlist(a)

df <- data.frame(a=a, b=names(a)) %>% mutate(key=cumsum(b=="experience.duration")) %>% 
      split(.$key) %>% lapply(function(x) x %>% select(-key) %>% spread(b, a)) %>% 
      do.call(rbind, .) %>% t %>% data.frame

df$key <- rownames(df)

Then you can filter in on the rows of interest 然后,您可以过滤感兴趣的行

The above would be equivalent to 以上将相当于

rbind(unlist(a)[1:8], unlist(a)[9:16],unlist(a)[17:24]) %>% t

Update 更新

try this for dput2 试试这个dput2

a <- unlist(dput2)

library(dplyr)
library(tidyr)

dummydf <- data.frame(b=c("experience.start", "experience.end", "experience.roleName", "experience.summary", 
                      "experience.org", "experience.org.name",  "experience.org.url", 
                      "_meta.weight", "_meta._sources._origin", "experience.duration"), key=1:10)


df <- data.frame(a=a, b=names(a))

df2 <- left_join(df, dummydf)
df2$key2 <- as.factor(cumsum(df2$key < c(0, df2$key[-length(df2$key)])) +1)
df_split <- split(df2, df2$key2)
df3 <- lapply(df_split, function(x){
       x %>% select(-c(key, key2)) %>% spread(b, a)
       }) %>% data.table::rbindlist(fill=TRUE) %>% t

df3 <- data.frame(df3)
i <- sapply(seq_along(dput2), function(y) rep(y, sapply(dput2, function(x) length(x))[y])) %>% unlist
names(df3) <- paste0(names(df3), "_", i)

View(df3)

Managed to figure something out, using dput3 above: 使用上面的dput3管理以dput3

a <- dput3

aa <- lapply(1:length(a), function(y){tryCatch(lapply(1:length(a[[y]]), 
  function(i){if(is.null(a[[y]][[i]]$experience$start)){"Null"}else{a[[y]][[i]]$experience$start}}),error=function(e) print(list()))})


for(i in 1:length(aa)){for(y in 1:length(aa[[i]])){tryCatch(for(z in length(aa[[i]][[y]]))
     {test <- rbind(test, data.frame(key = i, key2= y))},error=function(e) print(0))}}

aaa <- unlist(aa)
df <- data.frame(a=aaa)
df2 <- cbind(df, test)
i <- sapply(seq_along(aa), function(y) rep(y, sapply(aa, function(x) length(x))[y])) %>% unlist

df5 <- data.frame(dates = df2$a)
df5 <- t(df5)
df5 <- data.frame(df5)
names(df5) <- paste0(names(df5), "_", i)
df5[] <- lapply(df5[], as.character)
l1 <- lapply(split(stack(df5), as.numeric(sub('.*_', '', stack(df5)[,2]))), '[', 1)
df6 <- t(do.call(cbindPad, l1))
df6 <- data.frame(df6)

Will try and expand it so it works with more than one vertical (as currently in aa I isolate start ) 会尽量扩大它,所以它有一个以上的垂直(如当前在工作aa我隔离start

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM