简体   繁体   English

Gather()将列列表到R中的行

[英]Gather() list columns to rows in R

I want to gather() list columns to create new rows in my data frame. 我想收集()列表列以在我的数据框中创建新行。 I'm using the Game of Thrones data set in the repurrrsive package. 我正在使用repurrrsive包中的权力的游戏数据集。 Below is my code to set up the problem: 以下是我设置问题的代码:

library(tidyverse)
got_chars <- repurrrsive::got_chars
    df <- got_chars %>% 
    {
      tibble::tibble(
        Name = map_chr(., 'name'),
        Gender = map_chr(.,'gender'),
        Culture = map_chr(.,'culture'),
        Born = map_chr(.,'born'),
        Alive = map_chr(.,'alive'),
        Titles = map(.,'titles'),
        Aliases = map(., "aliases"),
        Allegiances = map(., "allegiances"),
        Books = map(.,'books'),
        POV_Books = map(.,'povBooks'),
        TV_Series = map(.,'tvSeries'),
        Actor = map(.,'playedBy')
      )
    }

What I would like to be able to do, but can't figure out is to gather() the list columns (eg Books, POV_Books, etc.) in order to create a new row for each record. 我希望能够做到的,但无法弄清楚是gather()列表列(例如Books,POV_Books等),以便为每条记录创建一个新行。 For example: 例如:

Name | 名称| Book
Theon Greyjoy | Theon Greyjoy | A Game of Thrones 权力的游戏
Theon Greyjoy | Theon Greyjoy | A Storm of Swords 剑风暴
Theon Greyjoy | Theon Greyjoy | A Feast for Crows 乌鸦的盛宴

The closest I've been able to get is: 我能得到的最接近的是:

df_books <- df %>%
  separate_rows(Books,sep="\"")

This will work, but leaves a trail of garbage behind from the c() characters within the vectors. 这样可以工作,但是从向量中的c()字符留下一堆垃圾。 I can filter those out, but I feel like there's a much better way and I might just not be trying the correct functions. 我可以过滤掉那些,但我觉得有一个更好的方法,我可能只是没有尝试正确的功能。 Any suggestions would be much appreciated, thanks! 任何建议将不胜感激,谢谢!

Your tibble currently looks like this: 你的目标看起来像这样:

df
# # A tibble: 30 x 12
#    Name               Gender Culture Born  Alive Titles Aliases Allegiances Books POV_Books TV_Series Actor
#    <chr>              <chr>  <chr>   <chr> <chr> <list> <list>  <list>      <lis> <list>    <list>    <lis>
#  1 Theon Greyjoy      Male   Ironbo… In 2… TRUE  <chr … <chr [… <chr [1]>   <chr… <chr [2]> <chr [6]> <chr…
#  2 Tyrion Lannister   Male   ""      In 2… TRUE  <chr … <chr [… <chr [1]>   <chr… <chr [4]> <chr [6]> <chr…
#  3 Victarion Greyjoy  Male   Ironbo… In 2… TRUE  <chr … <chr [… <chr [1]>   <chr… <chr [2]> <chr [1]> <chr…
#  4 Will               Male   ""      ""    FALSE <chr … <chr [… <NULL>      <chr… <chr [1]> <chr [1]> <chr…
#  5 Areo Hotah         Male   Norvos… In 2… TRUE  <chr … <chr [… <chr [1]>   <chr… <chr [2]> <chr [2]> <chr…
#  6 Chett              Male   ""      At H… FALSE <chr … <chr [… <NULL>      <chr… <chr [1]> <chr [1]> <chr…
#  7 Cressen            Male   ""      In 2… FALSE <chr … <chr [… <NULL>      <chr… <chr [1]> <chr [1]> <chr…
#  8 Arianne Martell    Female Dornish In 2… TRUE  <chr … <chr [… <chr [1]>   <chr… <chr [1]> <chr [1]> <chr…
#  9 Daenerys Targaryen Female Valyri… In 2… TRUE  <chr … <chr [… <chr [1]>   <chr… <chr [4]> <chr [6]> <chr…
# 10 Davos Seaworth     Male   Wester… In 2… TRUE  <chr … <chr [… <chr [2]>   <chr… <chr [3]> <chr [5]> <chr…
# # ... with 20 more rows

unnest() would be an obvious choice, but doesn't work if all of the lists are the same in terms of how many values they would expand to. unnest()将是一个明显的选择,但如果所有列表在扩展到多少值方面相同,则不起作用。

library(tidyverse)
unnest(df)
# Error: All nested columns must have the same number of elements.

One approach would be to use the following functions. 一种方法是使用以下功能。 flatten() makes the data "wide", and flattenLong() takes the "wide" data and makes it "long". flatten()使数据“宽”, flattenLong()获取“宽”数据并使其“长”。 The assumption that has been made about missing data is that if a vector in a list item is shorter than the matching vector in another list item, the missing data are last. 关于缺失数据的假设是,如果列表项中的向量比另一个列表项中的匹配向量短,则缺失数据是最后的。

flatten <- function(indt, cols, drop = FALSE) {
  require(data.table)
  if (!is.data.table(indt)) indt <- as.data.table(indt)
  x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
  nams <- paste(rep(cols, x), sequence(x), sep = "_")
  indt[, (nams) := unlist(lapply(.SD, data.table::transpose), recursive = FALSE), .SDcols = (cols)]
  if (isTRUE(drop)) indt[, (cols) := NULL]
  indt[]
}

flattenLong <- function(indt, cols) {
  ob <- setdiff(names(indt), cols)
  x <- flatten(indt, cols, TRUE)
  mv <- lapply(cols, function(y) grep(sprintf("^%s_", y), names(x)))
  setorderv(melt(x, measure.vars = mv, value.name = cols), ob)[]
}

Here's one way to use it, applying it to all list columns. 这是使用它的一种方法,将其应用于所有list列。

flattenLong(df, names(df)[sapply(df, is.list)])
#               Name Gender  Culture                                     Born Alive variable
#   1: Aeron Greyjoy   Male Ironborn In or between 269 AC and 273 AC, at Pyke  TRUE        1
#   2: Aeron Greyjoy   Male Ironborn In or between 269 AC and 273 AC, at Pyke  TRUE        2
#   3: Aeron Greyjoy   Male Ironborn In or between 269 AC and 273 AC, at Pyke  TRUE        3
#   4: Aeron Greyjoy   Male Ironborn In or between 269 AC and 273 AC, at Pyke  TRUE        4
#   5: Aeron Greyjoy   Male Ironborn In or between 269 AC and 273 AC, at Pyke  TRUE        5
# ---                                                                                      
# 476:          Will   Male                                                   FALSE       12
# 477:          Will   Male                                                   FALSE       13
# 478:          Will   Male                                                   FALSE       14
# 479:          Will   Male                                                   FALSE       15
# 480:          Will   Male                                                   FALSE       16
#                                      Titles        Aliases           Allegiances                Books
#   1:              Priest of the Drowned God   The Damphair House Greyjoy of Pyke    A Game of Thrones
#   2: Captain of the Golden Storm (formerly) Aeron Damphair                    NA     A Clash of Kings
#   3:                                     NA             NA                    NA    A Storm of Swords
#   4:                                     NA             NA                    NA A Dance with Dragons
#   5:                                     NA             NA                    NA                   NA
# ---                                                                                                 
# 476:                                     NA             NA                    NA                   NA
# 477:                                     NA             NA                    NA                   NA
# 478:                                     NA             NA                    NA                   NA
# 479:                                     NA             NA                    NA                   NA
# 480:                                     NA             NA                    NA                   NA
#              POV_Books TV_Series         Actor
#   1: A Feast for Crows  Season 6 Michael Feast
#   2:                NA        NA            NA
#   3:                NA        NA            NA
#   4:                NA        NA            NA
#   5:                NA        NA            NA
# ---                                          
# 476:                NA        NA            NA
# 477:                NA        NA            NA
# 478:                NA        NA            NA
# 479:                NA        NA            NA
# 480:                NA        NA            NA

You could also do something like any of the following for dealing with a single column: 您还可以执行以下任何操作来处理单个列:

flattenLong(df[c(names(df)[!sapply(df, is.list)], "Books")], "Books")

flattenLong(df[c("Name", "Gender", "Culture", "Born", "Alive", "Books")], "Books")

df %>% 
  select(Name, Gender, Culture, Born, Alive, Books) %>%
  flattenLong("Books")

This isn't at all meant to be identical to the "tidyverse" approach. 这完全不等同于“tidyverse”方法。 It handles NULL differently, and unnest s each group to be the same length. 它以不同的方式处理NULL ,并且每个组的长度unnest相同。 Consider the following dataset: 请考虑以下数据集:

mydf <- data.frame(V1 = c("a", "b", "c"), 
                   V2 = I(list(c(10, 20), NA_real_, c(20, 40, 60))), 
                   V3 = I(list(NULL, c("x", "y", "z"), c("BA", "BB"))))
mydf                   
#   V1         V2      V3
# 1  a     10, 20        
# 2  b         NA x, y, z
# 3  c 20, 40, 60  BA, BB

Difference #1: Number of values per group : 差异#1:每组的值数

# Note the resulting number of values per group
# Equivalent of
# as.data.table(mydf)[, list(unlist(V2)), V1]
mydf %>% select(V1, V2) %>% unnest()
#   V1 V2
# 1  a 10
# 2  a 20
# 3  b NA
# 4  c 20
# 5  c 40
# 6  c 60

flattenLong(mydf[c("V1", "V2")], "V2")
#    V1 variable V2
# 1:  a     V2_1 10
# 2:  a     V2_2 20
# 3:  a     V2_3 NA
# 4:  b     V2_1 NA
# 5:  b     V2_2 NA
# 6:  b     V2_3 NA
# 7:  c     V2_1 20
# 8:  c     V2_2 40
# 9:  c     V2_3 60

Difference #2: Handling NULL values 差异#2:处理NULL

mydf %>% select(V1, V3) %>% unnest()
# Error: Each column must either be a list of vectors or a list of data frames [V3]

flattenLong(mydf[c("V1", "V2")], "V2")
#    V1 variable V3
# 1:  a     V3_1 NA
# 2:  a     V3_2 NA
# 3:  a     V3_3 NA
# 4:  b     V3_1  x
# 5:  b     V3_2  y
# 6:  b     V3_3  z
# 7:  c     V3_1 BA
# 8:  c     V3_2 BB
# 9:  c     V3_3 NA

You can use unnest , but first you have to format the column in a way tidyr understands. 您可以使用unnest ,但首先您必须以tidyr理解的方式格式化列。

This means: 这意味着:

  • No NULL elements in the columns you want to unnest 您想要删除的列中没有NULL元素
  • having your items as one column data.frames and not vectors 将您的项目作为一列data.frames而不是vectors

.

library(tidyverse)
df %>%
  select(Name,Books) %>% # skip this line to keep all columns
  slice(which(lengths(Books)>0)) %>%
  mutate(Books = map(Books,~tibble(Book=.x))) %>%
  unnest(Books)

# # A tibble: 77 x 2
#                Name                      Book
#               <chr>                     <chr>
# 1     Theon Greyjoy         A Game of Thrones
# 2     Theon Greyjoy         A Storm of Swords
# 3     Theon Greyjoy         A Feast for Crows
# 4  Tyrion Lannister         A Feast for Crows
# 5  Tyrion Lannister The World of Ice and Fire
# 6 Victarion Greyjoy         A Game of Thrones
# 7 Victarion Greyjoy          A Clash of Kings
# 8 Victarion Greyjoy         A Storm of Swords
# 9              Will          A Clash of Kings
# 10        Areo Hotah         A Game of Thrones
# # ... with 67 more rows

Your tried solution works fine though if we filter the output (same output as my solution): 如果我们过滤输出(与我的解决方案相同的输出),你尝试过的解决方案工作正常:

df %>%
  select(Name, Books) %>%
  separate_rows(Books,sep="\"") %>%
  filter(!Books %in% c("c(",", ",")") & lengths(Books)>0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM