简体   繁体   English

从深层嵌套列表中提取特定元素并将其对齐到 R dataframe

[英]Extract and align specific elements from deeply nested list into R dataframe

I have a deeply nested list gathered from a JSON file via fromJSON().我有一个通过 fromJSON() 从 JSON 文件收集的深层嵌套列表。 Here is a minimal example showing the nesting but only 2 entries:这是一个显示嵌套但只有 2 个条目的最小示例:

` `

conditions <- list(
  list(
    PMID = 00001,
    Phrases = list(
      list(
        PhraseText = "Hodgkin Lymphoma",
        Mappings = mappings1 <- list(
          list(
            MappingScore = 1000,
            MappingCandidates = mc1 <- list(
              list(CandidateScore = 1000,
                   CandidateCUI = "C075655",
                   CandidateMatched = "Hodgkins Lymphoma",
                   CandidatePreferred = "Hodgkins Lymphoma",
                   MatchedWords = list(c("hodgkin", "lymphoma"))),
              list(CandidateScore = 850,
                   CandidateCUI = "C095659",
                   CandidateMatched = "Lymphoma",
                   CandidatePreferred = "Lymphoma",
                   MatchedWords = list(c("lymphoma"))))
          )
        )
      )
    )
  ),
  list(
    PMID = 00002,
    Phrases = list(
      list(
        PhraseText = "Plaque Psoriasis",
        Mappings = mappings2 <- list(
          list(MappingScore = 1000,
               MappingCandidates = mc2 <- list(
                 list(CandidateScore = 1000,
                      CandidateCUI = "C0125609",
                      CandidateMatched = "Plaque Psoriasis",
                      CandidatePreferred = "Plaque Psoriasis",
                      MatchedWords = list(c("plaque", "psoriasis"))),
                 list(CandidateScore = 750,
                      CandidateCUI = "C0320011",
                      CandidateMatched = "Psoriasis",
                      CandidatePreferred = "Psoriasis",
                      MatchedWords = list(c("psoriasis")))))
        )
      )
    )
  )
)

` `

Some of these levels are actually data frames but I can't seem to recreate this in the code without ruining the structure.其中一些级别实际上是数据框,但我似乎无法在不破坏结构的情况下在代码中重新创建它。 I am trying to extract specific elements from multiple levels of the nested list, and ideally get output like this (or something similar):我正在尝试从嵌套列表的多个级别中提取特定元素,理想情况下像这样(或类似的东西)得到 output:

` `

output <- data.frame(
  PhraseText = c("Hodgkins Lymphoma", "Hodgkins Lymphoma", "Plaque Psoriasis", "Plaque Psoriasis"),
  MappingScore = c(1000, 1000, 1000, 1000),
  CandidateScore = c(1000, 850, 1000, 750),
  CandidateCUI = c("C075655", "C095659", "C0125609", "C0320011"),
  CandidatePreferred = c("Hodgkins Lymphoma", "Lymphoma", "Plaque Psoriasis", "Psoriasis")
)

` `

I have tried several iterations of lapply, map, and hoist - but looping through the unnamed portions of the list (ie MappingCandidates[[1]] and MappingCandidates[[2]]) are throwing me and I can't seem to get the deepest elements (ie CandidateCUI) back up the chain and associated with the top-level elements (PhraseText).我已经尝试了 lapply、map 和 hoist 的几次迭代 - 但是循环遍历列表的未命名部分(即 MappingCandidates[[1]] 和 MappingCandidates[[2]])让我很困惑,我似乎无法得到最深的元素(即 CandidateCUI)支持链并与顶级元素(PhraseText)相关联。

` `

x <- lapply(conditions, function(i) {
  lapply(i[["Phrases"]][[1]][["Mappings"]], function(j) {
    lapply(j[["MappingCandidates"]], function(k) {
      k[c("CandidateScore", "CandidateCUI", "CandidatePreferred")]
    })
  })
})

` `

Using tidyr , we can unnest the list by combining a bunch of calls to unnest_wider() and unnest_longer() :使用tidyr ,我们可以通过组合对unnest_wider()unnest_longer()的一系列调用来取消嵌套列表:

library(tidyr)

tibble(conditions) |>
  unnest_wider(conditions) |>
  unnest_longer(Phrases) |>
  unnest_wider(Phrases) |>
  unnest_longer(Mappings) |>
  unnest_wider(Mappings) |>
  unnest_longer(MappingCandidates) |>
  unnest_wider(MappingCandidates) |>
  unnest_longer(MatchedWords)
#> # A tibble: 4 × 8
#>    PMID PhraseText       MappingScore CandidateScore CandidateCUI CandidateMatched  CandidatePreferred MatchedWords
#>   <dbl> <chr>                   <dbl>          <dbl> <chr>        <chr>             <chr>              <list>      
#> 1     1 Hodgkin Lymphoma         1000           1000 C075655      Hodgkins Lymphoma Hodgkins Lymphoma  <chr [2]>   
#> 2     1 Hodgkin Lymphoma         1000            850 C095659      Lymphoma          Lymphoma           <chr [1]>   
#> 3     2 Plaque Psoriasis         1000           1000 C0125609     Plaque Psoriasis  Plaque Psoriasis   <chr [2]>   
#> 4     2 Plaque Psoriasis         1000            750 C0320011     Psoriasis         Psoriasis          <chr [1]>

And another approach (perhaps easier to generalize) using rrapply() in the rrapply -package.另一种方法(可能更容易概括)在rrapply中使用rrapply() Here rrapply() is called twice with the option how = "bind" .这里rrapply()使用选项how = "bind"被调用了两次。 Once to bind together all repeated MappingCandidates and once to bind the other nodes ( PMID , Phrases , PhraseText , MappingScore ):一次将所有重复的MappingCandidates绑定在一起,一次绑定其他节点( PMIDPhrasesPhraseTextMappingScore ):

library(rrapply)

## bind MappingCandidates
candidateNodes <- rrapply(
  conditions, 
  how = "bind", 
  options = list(namecols = TRUE, coldepth = 8)
)
candidateNodes 
#>   L1      L2 L3       L4 L5                L6 L7 CandidateScore CandidateCUI  CandidateMatched CandidatePreferred    MatchedWords.1
#> 1  1 Phrases  1 Mappings  1 MappingCandidates  1           1000      C075655 Hodgkins Lymphoma  Hodgkins Lymphoma hodgkin, lymphoma
#> 2  1 Phrases  1 Mappings  1 MappingCandidates  2            850      C095659          Lymphoma           Lymphoma          lymphoma
#> 3  2 Phrases  1 Mappings  1 MappingCandidates  1           1000     C0125609  Plaque Psoriasis   Plaque Psoriasis plaque, psoriasis
#> 4  2 Phrases  1 Mappings  1 MappingCandidates  2            750     C0320011         Psoriasis          Psoriasis         psoriasis

## bind other nodes
otherNodes <- rrapply(
  conditions, 
  condition = \(x, .xparents) !"MappingCandidates" %in% .xparents, 
  how = "bind", 
  options = list(namecols = TRUE)
)
otherNodes
#>   L1 PMID Phrases.1.PhraseText Phrases.1.Mappings.1.MappingScore
#> 1  1    1     Hodgkin Lymphoma                              1000
#> 2  2    2     Plaque Psoriasis                              1000

## merge into single data.frame
allNodes <- merge(candidateNodes, otherNodes, by = "L1")
allNodes
#>   L1      L2 L3       L4 L5                L6 L7 CandidateScore CandidateCUI  CandidateMatched CandidatePreferred    MatchedWords.1 PMID Phrases.1.PhraseText Phrases.1.Mappings.1.MappingScore
#> 1  1 Phrases  1 Mappings  1 MappingCandidates  1           1000      C075655 Hodgkins Lymphoma  Hodgkins Lymphoma hodgkin, lymphoma    1     Hodgkin Lymphoma                              1000
#> 2  1 Phrases  1 Mappings  1 MappingCandidates  2            850      C095659          Lymphoma           Lymphoma          lymphoma    1     Hodgkin Lymphoma                              1000
#> 3  2 Phrases  1 Mappings  1 MappingCandidates  1           1000     C0125609  Plaque Psoriasis   Plaque Psoriasis plaque, psoriasis    2     Plaque Psoriasis                              1000
#> 4  2 Phrases  1 Mappings  1 MappingCandidates  2            750     C0320011         Psoriasis          Psoriasis         psoriasis    2     Plaque Psoriasis                              1000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM