[英]Extracting from Nested list to data frame
I will put dput
of what my list looks like at the bottom such that the q can be reproducible. 我会把
dput
的什么我的名单看起来像底部,使得Q可重复性。 The dput is of a
not x
. 该dput是
a
不x
。
I have a big nested list called x
that I'm trying to build a data frame from but cannot figure it out. 我有一个名为
x
的大型嵌套列表,我正在尝试构建一个数据框但无法弄明白。
I have done the first part: 我做了第一部分:
for(i in 1:3){a[[i]]<-x$results[[i]]$experiences
indx <- lengths(a)
zz <- as.data.frame(do.call(rbind,lapply(a, `length<-`, max(indx))))}
For this I used the following answer: Converting nested list (unequal length) to data frame 为此,我使用了以下答案: 将嵌套列表(不等长)转换为数据帧
This leaves me a data.frame with n columns for results where n is the max results for any i: 这给我留下了一个data.frame,其中n列为结果,其中n是任何i的最大结果:
v1 v2 v3
1 NULL NULL NULL
2 * * *
3 NULL NULL NULL
Each * is another nested list in the format list(experience = list(duration = ...
每个*是格式列表中的另一个嵌套列表
list(experience = list(duration = ...
For example the first *
in row 2, column v1. 例如,第2行第一列
*
第v1列。 I don't want the total list. 我不想要总清单。 I only want:
我只想要:
a[[2]][[1]]$experience$start
or in terms of the original list x: 或者就原始列表x而言:
x$results[[2]]$experiences[[1]]$experience$start
I feel like I'm nearly there with some tweaks. 我觉得我几乎有一些调整。 I tried:
我试过了:
for(i in 1:3){a[[i]]<-x$results[[i]]$experiences
indx <- lengths(a)
for(y in 1:length(a[[i]])) aa <- rbind(aa,tryCatch(x$results[[i]]$experiences[[y]]$experience$start, error=function(e) print(NA)))
zz <- as.data.frame(do.call(rbind,lapply(aa, `length<-`, max(indx))))}
Resulting in: 导致:
v1 v2 v3
1 NA NA NA
2 NA NA NA
3 2014 NA NA
4 2012 NA NA
5 2006 NA NA
6 NA NA NA
7 NA NA NA
Tried cbind instead of rbind on final line and that put all the dates in the first row. 在最后一行尝试了cbind而不是rbind,并将所有日期放在第一行。
I also tried the following: 我也尝试过以下方法:
for(i in 1:3){a[[i]]<-lengths(x$results[[i]]$experiences)
indx <- lengths(a)
for(y in 1:length(indx)){tt[i] <- tryCatch(x$results[[i]]$experiences[[y]]$experience$start, error=function(e) print(""))}
zz <- as.data.frame(do.call(rbind,lapply(tt, `length<-`, max(indx))))}
This came close, builds the right format but only returns the first result: 这很接近,构建正确的格式但只返回第一个结果:
v1 v2 v3
1 NA NA NA
2 2014 NA NA
3 NA NA NA
The format I want is: 我想要的格式是:
V1 V2 V3
1 NA NA NA
2 2014 2012 2006
3 NA NA NA
((Sample data now at bottom)) ((现在底部的样本数据))
Newest attempt: 最新尝试:
Doing the following but returns only the first start date from each a[[i]]
, the second loop I need to make the list aa[i][y]
something different. 执行以下操作但仅返回每个
a[[i]]
的第一个开始日期,第二个循环我需要使列表aa[i][y]
不同。
for(i in 1:3){a[[i]]<-x$results[[i]]$experiences
for(y in 1:length(a[[i]])){aa[i][y] = if(is.null(a[[i]][[y]]$experience$start)){"NULL"}else{a[[i]][[y]]$experience$start}}}
So for dput2
I'd like the form: 所以对于
dput2
我喜欢这种形式:
v1 v2 v3 v4 v5 v6 v7 v8
1 2015
2 2011 2007 null null null null null null
3 2016 2015 2015 2015 2013 2010
I dont mind if the blanks are null or na 我不介意空白是空还是娜
UPDATE UPDATE
The below answer almost works, however in my data the structure changes, the order of the names (roleName, duration etc) change so that ruins the answer as cumsum
is used to determine when a new list is found. 下面的答案几乎可以正常工作,但是在我的数据中,结构发生了变化,名称的顺序(roleName,duration等)发生了变化,因此当使用
cumsum
确定何时找到新列表时,会破坏答案。 If you have duration
then start
the keys are 9
and 1
and the cumsum
part labels them two different lists. 如果你有
duration
那么start
键是9
和1
,而cumsum
部分标记它们两个不同的列表。
I wrote the following: 我写了以下内容:
my.list <- list(structure(
list(
experience = structure(
list(
start = "1",
end = "1",
roleName = "a",
summary = "a",
duration = "a",
current = "a",
org = structure(list(name = "a", url = "a"), .Names = c("name","url")),
location = structure(
list(
displayLocation = NULL,
lat = NULL,
lng = NULL
),
.Names = c("displayLocation",
"lat", "lng")
) ),.Names = c("start", "end", "roleName", "summary", "duration", "current", "org", "location")),
`_meta` = structure(
list(weight = 1L, `_sources` = list(structure(
list(`_origin` = "a"), .Names = "_origin"
))),.Names = c("weight", "_sources"))),.Names = c("experience", "_meta")))
Then: 然后:
aa <- lapply(1:length(a), function(y){tryCatch(lapply(1:length(a[[y]]),
function(i){a[[y]][[i]]$experience[names(my.list2[[1]]$experience)]}), error=function(e) print(list()))})
This changes the structure such that key2
will always be in the right order. 这会改变结构,使
key2
始终处于正确的顺序。
However Then I found after this loop I have another issue. 然而,我发现在这个循环后我有另一个问题。
Sometimes I have for example nothing but a roleName in the experience list. 有时我只有经验列表中的roleName。 If that occurs twice in a row the keys are repeated.
如果连续两次出现,则重复键。
cumsum
treats them as the same experience instead of separate ones. cumsum
将它们视为相同的体验,而不是单独的体验。
This means I cannot create df3
because of duplicate identifiers for rows. 这意味着我无法创建
df3
因为行的重复标识符。 And even if I could by removing troublesome rows, the names wouldn't match as i
in the solution below matches the names using the sequence, if I remove any rows that changes the lengths. 即使我可以通过删除麻烦的行,名称也不匹配,因为
i
在下面的解决方案中匹配使用序列的名称,如果我删除任何更改长度的行。
Here is my total code for more insight: 以下是我的更多见解的总代码:
for(i in 1:x$count){a[[i]]<-x$results[[i]]$experiences}
aa <- lapply(1:length(a), function(y){tryCatch(lapply(1:length(a[[y]]),
function(i){a[[y]][[i]]$experience[names(my.list2[[1]]$experience)]}), error=function(e) print(list()))})
aaa <- unlist(aa)
dummydf <- data.frame(b=c("start", "end", "roleName", "summary",
"duration", "current", "org.name", "org.url"), key=1:8)
df <- data.frame(a=aaa, b=names(aaa))
df2 <- left_join(df, dummydf)
df2$key2 <- as.factor(cumsum(df2$key < c(0, df2$key[-length(df2$key)])) +1)
df_split <- split(df2, df2$key2)
df3 <- lapply(df_split, function(x){
x %>% select(-c(key, key2)) %>% spread(b, a)
}) %>% data.table::rbindlist(fill=TRUE) %>% t
df3 <- data.frame(df3)
i <- sapply(seq_along(aa), function(y) rep(y, sapply(aa, function(x) length(x))[y])) %>% unlist
names(df3) <- paste0(names(df3), "_", i)
df4 <- data.frame(t(df3))
df4$dates <- as.Date(NA)
df4$dates <- as.Date(df4$start)
df4 <- data.frame(dates = df4$dates)
df4 <- t(df4)
df4 <- data.frame(df4)
names(df4) <- paste0(names(df4), "_", i)
df4[] <- lapply(df4[], as.character)
l1 <- lapply(split(stack(df4), sub('.*_', '', stack(df4)[,2])), '[', 1)
df5 <- t(do.call(cbindPad, l1))
df5 <- data.frame(df5)
cbindpad
taken from this question cbindpad
取自这个问题
New sample code including the issues: 新示例代码包括以下问题:
dput3 =
list(list(), list(
structure(list(experience = structure(list(
duration = "1", start = "2014",
end = "3000", roleName = "a",
summary = "aaa",
org = structure(list(name = "a"), .Names = "name"),
location = structure(list(displayLocation = NULL, lat = NULL,
lng = NULL), .Names = c("displayLocation", "lat", "lng"
))), .Names = c("duration", "start", "end", "roleName", "summary",
"org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list(
structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight",
"_sources"))), .Names = c("experience", "_meta")),
structure(list(
experience = structure(list(end = "3000",
start = "2012", duration = "2",
roleName = "a", summary = "aaa",
org = structure(list(name = "None"), .Names = "name"),
location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("duration", "start", "end", "roleName",
"summary", "org", "location")), `_meta` = structure(list(
weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta")),
structure(list(
experience = structure(list(duration = "3",
start = "2006", end = "3000",
roleName = "a", summary = "aaa", org = structure(list(name = " "), .Names = "name"),
location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("duration", "start", "end", "roleName",
"summary", "org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list(structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight",
"_sources"))), .Names = c("experience", "_meta")),
structure(list(
experience = structure(list(roleName = "a",
location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("roleName",
"location")), `_meta` = structure(list(
weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta")),
structure(list(
experience = structure(list(roleName = "a",
location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("roleName",
"location")), `_meta` = structure(list(
weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta"))
),
list(
structure(list(experience = structure(list(
duration = "1", start = "2014",
end = "3000", roleName = "a",
summary = "aaa",
org = structure(list(name = "a"), .Names = "name"),
location = structure(list(displayLocation = NULL, lat = NULL,
lng = NULL), .Names = c("displayLocation", "lat", "lng"
))), .Names = c("duration", "start", "end", "roleName", "summary",
"org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list(
structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight",
"_sources"))), .Names = c("experience", "_meta"))))
Maybe this can help 也许这可以帮助
library(dplyr)
library(tidyr)
a <- unlist(a)
df <- data.frame(a=a, b=names(a)) %>% mutate(key=cumsum(b=="experience.duration")) %>%
split(.$key) %>% lapply(function(x) x %>% select(-key) %>% spread(b, a)) %>%
do.call(rbind, .) %>% t %>% data.frame
df$key <- rownames(df)
Then you can filter in on the rows of interest 然后,您可以过滤感兴趣的行
The above would be equivalent to 以上将相当于
rbind(unlist(a)[1:8], unlist(a)[9:16],unlist(a)[17:24]) %>% t
try this for dput2
试试这个
dput2
a <- unlist(dput2)
library(dplyr)
library(tidyr)
dummydf <- data.frame(b=c("experience.start", "experience.end", "experience.roleName", "experience.summary",
"experience.org", "experience.org.name", "experience.org.url",
"_meta.weight", "_meta._sources._origin", "experience.duration"), key=1:10)
df <- data.frame(a=a, b=names(a))
df2 <- left_join(df, dummydf)
df2$key2 <- as.factor(cumsum(df2$key < c(0, df2$key[-length(df2$key)])) +1)
df_split <- split(df2, df2$key2)
df3 <- lapply(df_split, function(x){
x %>% select(-c(key, key2)) %>% spread(b, a)
}) %>% data.table::rbindlist(fill=TRUE) %>% t
df3 <- data.frame(df3)
i <- sapply(seq_along(dput2), function(y) rep(y, sapply(dput2, function(x) length(x))[y])) %>% unlist
names(df3) <- paste0(names(df3), "_", i)
View(df3)
Managed to figure something out, using dput3
above: 使用上面的
dput3
管理以dput3
:
a <- dput3
aa <- lapply(1:length(a), function(y){tryCatch(lapply(1:length(a[[y]]),
function(i){if(is.null(a[[y]][[i]]$experience$start)){"Null"}else{a[[y]][[i]]$experience$start}}),error=function(e) print(list()))})
for(i in 1:length(aa)){for(y in 1:length(aa[[i]])){tryCatch(for(z in length(aa[[i]][[y]]))
{test <- rbind(test, data.frame(key = i, key2= y))},error=function(e) print(0))}}
aaa <- unlist(aa)
df <- data.frame(a=aaa)
df2 <- cbind(df, test)
i <- sapply(seq_along(aa), function(y) rep(y, sapply(aa, function(x) length(x))[y])) %>% unlist
df5 <- data.frame(dates = df2$a)
df5 <- t(df5)
df5 <- data.frame(df5)
names(df5) <- paste0(names(df5), "_", i)
df5[] <- lapply(df5[], as.character)
l1 <- lapply(split(stack(df5), as.numeric(sub('.*_', '', stack(df5)[,2]))), '[', 1)
df6 <- t(do.call(cbindPad, l1))
df6 <- data.frame(df6)
Will try and expand it so it works with more than one vertical (as currently in aa
I isolate start
) 会尽量扩大它,所以它有一个以上的垂直(如当前在工作
aa
我隔离start
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.