[英]Extract and align specific elements from deeply nested list into R dataframe
I have a deeply nested list gathered from a JSON file via fromJSON().我有一个通过 fromJSON() 从 JSON 文件收集的深层嵌套列表。 Here is a minimal example showing the nesting but only 2 entries:
这是一个显示嵌套但只有 2 个条目的最小示例:
` `
conditions <- list(
list(
PMID = 00001,
Phrases = list(
list(
PhraseText = "Hodgkin Lymphoma",
Mappings = mappings1 <- list(
list(
MappingScore = 1000,
MappingCandidates = mc1 <- list(
list(CandidateScore = 1000,
CandidateCUI = "C075655",
CandidateMatched = "Hodgkins Lymphoma",
CandidatePreferred = "Hodgkins Lymphoma",
MatchedWords = list(c("hodgkin", "lymphoma"))),
list(CandidateScore = 850,
CandidateCUI = "C095659",
CandidateMatched = "Lymphoma",
CandidatePreferred = "Lymphoma",
MatchedWords = list(c("lymphoma"))))
)
)
)
)
),
list(
PMID = 00002,
Phrases = list(
list(
PhraseText = "Plaque Psoriasis",
Mappings = mappings2 <- list(
list(MappingScore = 1000,
MappingCandidates = mc2 <- list(
list(CandidateScore = 1000,
CandidateCUI = "C0125609",
CandidateMatched = "Plaque Psoriasis",
CandidatePreferred = "Plaque Psoriasis",
MatchedWords = list(c("plaque", "psoriasis"))),
list(CandidateScore = 750,
CandidateCUI = "C0320011",
CandidateMatched = "Psoriasis",
CandidatePreferred = "Psoriasis",
MatchedWords = list(c("psoriasis")))))
)
)
)
)
)
` `
Some of these levels are actually data frames but I can't seem to recreate this in the code without ruining the structure.其中一些级别实际上是数据框,但我似乎无法在不破坏结构的情况下在代码中重新创建它。 I am trying to extract specific elements from multiple levels of the nested list, and ideally get output like this (or something similar):
我正在尝试从嵌套列表的多个级别中提取特定元素,理想情况下像这样(或类似的东西)得到 output:
` `
output <- data.frame(
PhraseText = c("Hodgkins Lymphoma", "Hodgkins Lymphoma", "Plaque Psoriasis", "Plaque Psoriasis"),
MappingScore = c(1000, 1000, 1000, 1000),
CandidateScore = c(1000, 850, 1000, 750),
CandidateCUI = c("C075655", "C095659", "C0125609", "C0320011"),
CandidatePreferred = c("Hodgkins Lymphoma", "Lymphoma", "Plaque Psoriasis", "Psoriasis")
)
` `
I have tried several iterations of lapply, map, and hoist - but looping through the unnamed portions of the list (ie MappingCandidates[[1]] and MappingCandidates[[2]]) are throwing me and I can't seem to get the deepest elements (ie CandidateCUI) back up the chain and associated with the top-level elements (PhraseText).我已经尝试了 lapply、map 和 hoist 的几次迭代 - 但是循环遍历列表的未命名部分(即 MappingCandidates[[1]] 和 MappingCandidates[[2]])让我很困惑,我似乎无法得到最深的元素(即 CandidateCUI)支持链并与顶级元素(PhraseText)相关联。
` `
x <- lapply(conditions, function(i) {
lapply(i[["Phrases"]][[1]][["Mappings"]], function(j) {
lapply(j[["MappingCandidates"]], function(k) {
k[c("CandidateScore", "CandidateCUI", "CandidatePreferred")]
})
})
})
` `
Using tidyr
, we can unnest the list by combining a bunch of calls to unnest_wider()
and unnest_longer()
:使用
tidyr
,我们可以通过组合对unnest_wider()
和unnest_longer()
的一系列调用来取消嵌套列表:
library(tidyr)
tibble(conditions) |>
unnest_wider(conditions) |>
unnest_longer(Phrases) |>
unnest_wider(Phrases) |>
unnest_longer(Mappings) |>
unnest_wider(Mappings) |>
unnest_longer(MappingCandidates) |>
unnest_wider(MappingCandidates) |>
unnest_longer(MatchedWords)
#> # A tibble: 4 × 8
#> PMID PhraseText MappingScore CandidateScore CandidateCUI CandidateMatched CandidatePreferred MatchedWords
#> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <list>
#> 1 1 Hodgkin Lymphoma 1000 1000 C075655 Hodgkins Lymphoma Hodgkins Lymphoma <chr [2]>
#> 2 1 Hodgkin Lymphoma 1000 850 C095659 Lymphoma Lymphoma <chr [1]>
#> 3 2 Plaque Psoriasis 1000 1000 C0125609 Plaque Psoriasis Plaque Psoriasis <chr [2]>
#> 4 2 Plaque Psoriasis 1000 750 C0320011 Psoriasis Psoriasis <chr [1]>
And another approach (perhaps easier to generalize) using rrapply()
in the rrapply
-package.另一种方法(可能更容易概括)在
rrapply
中使用rrapply()
。 Here rrapply()
is called twice with the option how = "bind"
.这里
rrapply()
使用选项how = "bind"
被调用了两次。 Once to bind together all repeated MappingCandidates
and once to bind the other nodes ( PMID
, Phrases
, PhraseText
, MappingScore
):一次将所有重复的
MappingCandidates
绑定在一起,一次绑定其他节点( PMID
、 Phrases
、 PhraseText
、 MappingScore
):
library(rrapply)
## bind MappingCandidates
candidateNodes <- rrapply(
conditions,
how = "bind",
options = list(namecols = TRUE, coldepth = 8)
)
candidateNodes
#> L1 L2 L3 L4 L5 L6 L7 CandidateScore CandidateCUI CandidateMatched CandidatePreferred MatchedWords.1
#> 1 1 Phrases 1 Mappings 1 MappingCandidates 1 1000 C075655 Hodgkins Lymphoma Hodgkins Lymphoma hodgkin, lymphoma
#> 2 1 Phrases 1 Mappings 1 MappingCandidates 2 850 C095659 Lymphoma Lymphoma lymphoma
#> 3 2 Phrases 1 Mappings 1 MappingCandidates 1 1000 C0125609 Plaque Psoriasis Plaque Psoriasis plaque, psoriasis
#> 4 2 Phrases 1 Mappings 1 MappingCandidates 2 750 C0320011 Psoriasis Psoriasis psoriasis
## bind other nodes
otherNodes <- rrapply(
conditions,
condition = \(x, .xparents) !"MappingCandidates" %in% .xparents,
how = "bind",
options = list(namecols = TRUE)
)
otherNodes
#> L1 PMID Phrases.1.PhraseText Phrases.1.Mappings.1.MappingScore
#> 1 1 1 Hodgkin Lymphoma 1000
#> 2 2 2 Plaque Psoriasis 1000
## merge into single data.frame
allNodes <- merge(candidateNodes, otherNodes, by = "L1")
allNodes
#> L1 L2 L3 L4 L5 L6 L7 CandidateScore CandidateCUI CandidateMatched CandidatePreferred MatchedWords.1 PMID Phrases.1.PhraseText Phrases.1.Mappings.1.MappingScore
#> 1 1 Phrases 1 Mappings 1 MappingCandidates 1 1000 C075655 Hodgkins Lymphoma Hodgkins Lymphoma hodgkin, lymphoma 1 Hodgkin Lymphoma 1000
#> 2 1 Phrases 1 Mappings 1 MappingCandidates 2 850 C095659 Lymphoma Lymphoma lymphoma 1 Hodgkin Lymphoma 1000
#> 3 2 Phrases 1 Mappings 1 MappingCandidates 1 1000 C0125609 Plaque Psoriasis Plaque Psoriasis plaque, psoriasis 2 Plaque Psoriasis 1000
#> 4 2 Phrases 1 Mappings 1 MappingCandidates 2 750 C0320011 Psoriasis Psoriasis psoriasis 2 Plaque Psoriasis 1000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.