简体   繁体   English

将列表(带有嵌套向量)非规范化/强制转换为 R 中的 data.frame

[英]denormalize/coerce list (with nested vectors) to data.frame in R

I'm reading a yaml file like我正在阅读一个 yaml 文件,例如

- person_id: 111
  person_name: Russell
  time:
  - 1
  - 2
  - 3
  value:
  - a
  - b
  - c
- person_id: 222
  person_name: Steven
  time:
  - 1
  - 2
  value:
  - d
  - e

that I want to denormalize to:我想非规范化为:

  person_id person_name time value
1       111     Russell    1     a
2       111     Russell    2     b
3       111     Russell    3     c
4       222      Steven    1     d
5       222      Steven    2     e

I have a solution, but I was hoping there is something more concise.我有一个解决方案,但我希望有更简洁的方法。 Here's the nested list:这是嵌套列表:

l <- list(
  list( 
    person_id   = 111L,
    person_name = "Russell", 
    time        = 1:3, 
    value       = letters[1:3]
  ),
  list( 
    person_id   = 222L,
    person_name = "Steven", 
    time        = 1:2, 
    value       = letters[4:5]
  )
)   

Regarding possible duplicates, this question is similar to (1) How to denormalize nested list in R?关于可能的重复,这个问题类似于 (1)如何在 R 中非规范化嵌套列表? , but the structure is different (the round / diff / saldo structure is transposed compared to time / value here), and to (2) Split comma-separated column into separate rows , but time is vector, instead of a comma-separated element like director . ,但结构不同( round / diff / saldo结构在这里与time / value相比转置),以及(2)将逗号分隔的列拆分为单独的行,但time是向量,而不是逗号分隔的元素喜欢director I'm hoping this different structure helps.我希望这种不同的结构有所帮助。

Reduce(rbind,lapply(l,data.frame))

To compliment the ideas/approaches by @lmo and @submartingale, here's a purrr/tidyverse version that converts each nested listed into a data.frame/tibble (by replicating the parent elements of name & id), then stacks them into a single tibble. 为了补充@lmo和@submartingale的想法/方法,这是一个purrr / tidyverse版本,该版本将列出的每个嵌套转换为data.frame / tibble(通过复制name和id的父元素),然后将它们堆叠为一个tibble 。

l %>% 
  purrr::map_df(tibble::as_tibble)

Thanks guys for proposing something so concise and generalizable. 谢谢你们提出的如此简洁和可概括的建议。

A simple base R method is to use lapply and data.frame to return a list of data.frames and then used do.call with rbind to combine the data.frames into a single data.frame object. 一个简单的基本R方法是使用lapplydata.frame返回data.frame的列表,然后将其与rbind一起使用do.call将data.frame组合为单个data.frame对象。

do.call(rbind, lapply(l, data.frame))

which returns 哪个返回

  person_id person_name time value
1       111     Russell    1     a
2       111     Russell    2     b
3       111     Russell    3     c
4       222      Steven    1     d
5       222      Steven    2     e

Note that person_name and value will be factor vectors, which can be annoying to work with. 请注意,person_name和value将是因子向量,使用时会很烦人。 If desired, you can convert these to character vectors using the stringsAsFactors argument. 如果需要,可以使用stringsAsFactors参数将其转换为字符向量。

do.call(rbind, lapply(l, data.frame, stringsAsFactors=FALSE))

The printed output looks the same, but the underlying data types of these two variables has changed. 打印的输出看起来相同,但是这两个变量的基础数据类型已更改。

This works, but is less than ideal because (a) each vector in the new data.frame needs to be handled and (b) the type of each vector is explicit ( eg , purrr:map_chr vs purrr:map_int ) 这可行,但不理想,因为(a)需要处理新data.frame中的每个向量,并且(b)每个向量的类型都是显式的( 例如 purrr:map_chrpurrr:map_int

# Step 1: Determine how many time the 'parent' rows need to be replicated.
values_per_person <- l %>% 
  purrr::modify_depth(2, length) %>% 
  purrr::map_int("value")

# Step 2: Pull out the parent rows and replicate the elements to match `time`.
id_replicated <- l %>% 
  purrr::map_int("person_id") %>% 
  rep(times=values_per_person)    
name_replicated <- l %>%
  purrr::map_chr("person_name") %>% 
  rep(times=values_per_person)

# Step 3: Pull out the nested/child rows.
time <- l %>%
  purrr::modify_depth(1, "time") %>% 
  purrr::flatten_int()
value <- l %>%
  purrr::modify_depth(1, "value") %>% 
  purrr::flatten_chr()

# Step 4: Combine the vectors in a data frame.
data.frame(
  person_id   = id_replicated,
  person_name = name_replicated,
  time        = time,
  value       = value
)

( Four years later and I'm still using this once or twice a month. ) The yaml package provides a map handler . 四年后,我仍然每个月使用一两次。 )yaml 包提供了一个地图处理程序 In this case, each map/person is converted into a tibble .在这种情况下,每个 map/person 都被转换为tibble Then dplyr::bind_rows() stacks all the tibbles to create a longer, single tibble.然后dplyr::bind_rows()所有小标题堆叠起来以创建一个更长的单个小标题。

path_yaml |> # Replace this line with code below to see a working example.
  yaml::read_yaml(
    handlers = list(map = \(x) tibble::as_tibble(x))
  ) |> 
  dplyr::bind_rows()

Extra details : with this simple dataset, the handler isn't even required -- bind_rows() converts each piece automatically.额外的细节:使用这个简单的数据集,甚至不需要处理程序—— bind_rows()自动转换每个部分。 But I'm skeptical that it will always know how to coerce each map before stacking.但我怀疑它总是知道如何在堆叠之前强制每个地图。 Plus this explicit handler better communicates the intent.此外,这个显式处理程序可以更好地传达意图。

If you want to play with a reproducible example, replace the file path ( ie , the first line) with如果您想使用可重现的示例,请将文件路径(第一行)替换为

string <- 
"- person_id: 111
  person_name: Russell
  time:
  - 1
  - 2
  - 3
  value:
  - a
  - b
  - c
- person_id: 222
  person_name: Steven
  time:
  - 1
  - 2
  value:
  - d
  - e
"

textConnection(string) |> 
  yaml::read_yaml(...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM