[英]Gather() list columns to rows in R
I want to gather() list columns to create new rows in my data frame. 我想收集()列表列以在我的数据框中创建新行。 I'm using the Game of Thrones data set in the repurrrsive package.
我正在使用repurrrsive包中的权力的游戏数据集。 Below is my code to set up the problem:
以下是我设置问题的代码:
library(tidyverse)
got_chars <- repurrrsive::got_chars
df <- got_chars %>%
{
tibble::tibble(
Name = map_chr(., 'name'),
Gender = map_chr(.,'gender'),
Culture = map_chr(.,'culture'),
Born = map_chr(.,'born'),
Alive = map_chr(.,'alive'),
Titles = map(.,'titles'),
Aliases = map(., "aliases"),
Allegiances = map(., "allegiances"),
Books = map(.,'books'),
POV_Books = map(.,'povBooks'),
TV_Series = map(.,'tvSeries'),
Actor = map(.,'playedBy')
)
}
What I would like to be able to do, but can't figure out is to gather()
the list columns (eg Books, POV_Books, etc.) in order to create a new row for each record. 我希望能够做到的,但无法弄清楚是
gather()
列表列(例如Books,POV_Books等),以便为每条记录创建一个新行。 For example: 例如:
Name | 名称| Book
书
Theon Greyjoy | Theon Greyjoy | A Game of Thrones
权力的游戏
Theon Greyjoy | Theon Greyjoy | A Storm of Swords
剑风暴
Theon Greyjoy | Theon Greyjoy | A Feast for Crows
乌鸦的盛宴
The closest I've been able to get is: 我能得到的最接近的是:
df_books <- df %>%
separate_rows(Books,sep="\"")
This will work, but leaves a trail of garbage behind from the c() characters within the vectors. 这样可以工作,但是从向量中的c()字符留下一堆垃圾。 I can filter those out, but I feel like there's a much better way and I might just not be trying the correct functions.
我可以过滤掉那些,但我觉得有一个更好的方法,我可能只是没有尝试正确的功能。 Any suggestions would be much appreciated, thanks!
任何建议将不胜感激,谢谢!
Your tibble currently looks like this: 你的目标看起来像这样:
df
# # A tibble: 30 x 12
# Name Gender Culture Born Alive Titles Aliases Allegiances Books POV_Books TV_Series Actor
# <chr> <chr> <chr> <chr> <chr> <list> <list> <list> <lis> <list> <list> <lis>
# 1 Theon Greyjoy Male Ironbo… In 2… TRUE <chr … <chr [… <chr [1]> <chr… <chr [2]> <chr [6]> <chr…
# 2 Tyrion Lannister Male "" In 2… TRUE <chr … <chr [… <chr [1]> <chr… <chr [4]> <chr [6]> <chr…
# 3 Victarion Greyjoy Male Ironbo… In 2… TRUE <chr … <chr [… <chr [1]> <chr… <chr [2]> <chr [1]> <chr…
# 4 Will Male "" "" FALSE <chr … <chr [… <NULL> <chr… <chr [1]> <chr [1]> <chr…
# 5 Areo Hotah Male Norvos… In 2… TRUE <chr … <chr [… <chr [1]> <chr… <chr [2]> <chr [2]> <chr…
# 6 Chett Male "" At H… FALSE <chr … <chr [… <NULL> <chr… <chr [1]> <chr [1]> <chr…
# 7 Cressen Male "" In 2… FALSE <chr … <chr [… <NULL> <chr… <chr [1]> <chr [1]> <chr…
# 8 Arianne Martell Female Dornish In 2… TRUE <chr … <chr [… <chr [1]> <chr… <chr [1]> <chr [1]> <chr…
# 9 Daenerys Targaryen Female Valyri… In 2… TRUE <chr … <chr [… <chr [1]> <chr… <chr [4]> <chr [6]> <chr…
# 10 Davos Seaworth Male Wester… In 2… TRUE <chr … <chr [… <chr [2]> <chr… <chr [3]> <chr [5]> <chr…
# # ... with 20 more rows
unnest()
would be an obvious choice, but doesn't work if all of the lists are the same in terms of how many values they would expand to. unnest()
将是一个明显的选择,但如果所有列表在扩展到多少值方面相同,则不起作用。
library(tidyverse)
unnest(df)
# Error: All nested columns must have the same number of elements.
One approach would be to use the following functions. 一种方法是使用以下功能。
flatten()
makes the data "wide", and flattenLong()
takes the "wide" data and makes it "long". flatten()
使数据“宽”, flattenLong()
获取“宽”数据并使其“长”。 The assumption that has been made about missing data is that if a vector in a list item is shorter than the matching vector in another list item, the missing data are last. 关于缺失数据的假设是,如果列表项中的向量比另一个列表项中的匹配向量短,则缺失数据是最后的。
flatten <- function(indt, cols, drop = FALSE) {
require(data.table)
if (!is.data.table(indt)) indt <- as.data.table(indt)
x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
nams <- paste(rep(cols, x), sequence(x), sep = "_")
indt[, (nams) := unlist(lapply(.SD, data.table::transpose), recursive = FALSE), .SDcols = (cols)]
if (isTRUE(drop)) indt[, (cols) := NULL]
indt[]
}
flattenLong <- function(indt, cols) {
ob <- setdiff(names(indt), cols)
x <- flatten(indt, cols, TRUE)
mv <- lapply(cols, function(y) grep(sprintf("^%s_", y), names(x)))
setorderv(melt(x, measure.vars = mv, value.name = cols), ob)[]
}
Here's one way to use it, applying it to all list
columns. 这是使用它的一种方法,将其应用于所有
list
列。
flattenLong(df, names(df)[sapply(df, is.list)])
# Name Gender Culture Born Alive variable
# 1: Aeron Greyjoy Male Ironborn In or between 269 AC and 273 AC, at Pyke TRUE 1
# 2: Aeron Greyjoy Male Ironborn In or between 269 AC and 273 AC, at Pyke TRUE 2
# 3: Aeron Greyjoy Male Ironborn In or between 269 AC and 273 AC, at Pyke TRUE 3
# 4: Aeron Greyjoy Male Ironborn In or between 269 AC and 273 AC, at Pyke TRUE 4
# 5: Aeron Greyjoy Male Ironborn In or between 269 AC and 273 AC, at Pyke TRUE 5
# ---
# 476: Will Male FALSE 12
# 477: Will Male FALSE 13
# 478: Will Male FALSE 14
# 479: Will Male FALSE 15
# 480: Will Male FALSE 16
# Titles Aliases Allegiances Books
# 1: Priest of the Drowned God The Damphair House Greyjoy of Pyke A Game of Thrones
# 2: Captain of the Golden Storm (formerly) Aeron Damphair NA A Clash of Kings
# 3: NA NA NA A Storm of Swords
# 4: NA NA NA A Dance with Dragons
# 5: NA NA NA NA
# ---
# 476: NA NA NA NA
# 477: NA NA NA NA
# 478: NA NA NA NA
# 479: NA NA NA NA
# 480: NA NA NA NA
# POV_Books TV_Series Actor
# 1: A Feast for Crows Season 6 Michael Feast
# 2: NA NA NA
# 3: NA NA NA
# 4: NA NA NA
# 5: NA NA NA
# ---
# 476: NA NA NA
# 477: NA NA NA
# 478: NA NA NA
# 479: NA NA NA
# 480: NA NA NA
You could also do something like any of the following for dealing with a single column: 您还可以执行以下任何操作来处理单个列:
flattenLong(df[c(names(df)[!sapply(df, is.list)], "Books")], "Books")
flattenLong(df[c("Name", "Gender", "Culture", "Born", "Alive", "Books")], "Books")
df %>%
select(Name, Gender, Culture, Born, Alive, Books) %>%
flattenLong("Books")
This isn't at all meant to be identical to the "tidyverse" approach. 这完全不等同于“tidyverse”方法。 It handles
NULL
differently, and unnest
s each group to be the same length. 它以不同的方式处理
NULL
,并且每个组的长度unnest
相同。 Consider the following dataset: 请考虑以下数据集:
mydf <- data.frame(V1 = c("a", "b", "c"),
V2 = I(list(c(10, 20), NA_real_, c(20, 40, 60))),
V3 = I(list(NULL, c("x", "y", "z"), c("BA", "BB"))))
mydf
# V1 V2 V3
# 1 a 10, 20
# 2 b NA x, y, z
# 3 c 20, 40, 60 BA, BB
Difference #1: Number of values per group : 差异#1:每组的值数 :
# Note the resulting number of values per group
# Equivalent of
# as.data.table(mydf)[, list(unlist(V2)), V1]
mydf %>% select(V1, V2) %>% unnest()
# V1 V2
# 1 a 10
# 2 a 20
# 3 b NA
# 4 c 20
# 5 c 40
# 6 c 60
flattenLong(mydf[c("V1", "V2")], "V2")
# V1 variable V2
# 1: a V2_1 10
# 2: a V2_2 20
# 3: a V2_3 NA
# 4: b V2_1 NA
# 5: b V2_2 NA
# 6: b V2_3 NA
# 7: c V2_1 20
# 8: c V2_2 40
# 9: c V2_3 60
Difference #2: Handling NULL
values 差异#2:处理
NULL
值
mydf %>% select(V1, V3) %>% unnest()
# Error: Each column must either be a list of vectors or a list of data frames [V3]
flattenLong(mydf[c("V1", "V2")], "V2")
# V1 variable V3
# 1: a V3_1 NA
# 2: a V3_2 NA
# 3: a V3_3 NA
# 4: b V3_1 x
# 5: b V3_2 y
# 6: b V3_3 z
# 7: c V3_1 BA
# 8: c V3_2 BB
# 9: c V3_3 NA
You can use unnest
, but first you have to format the column in a way tidyr
understands. 您可以使用
unnest
,但首先您必须以tidyr
理解的方式格式化列。
This means: 这意味着:
NULL
elements in the columns you want to unnest NULL
元素 data.frames
and not vectors
data.frames
而不是vectors
. 。
library(tidyverse)
df %>%
select(Name,Books) %>% # skip this line to keep all columns
slice(which(lengths(Books)>0)) %>%
mutate(Books = map(Books,~tibble(Book=.x))) %>%
unnest(Books)
# # A tibble: 77 x 2
# Name Book
# <chr> <chr>
# 1 Theon Greyjoy A Game of Thrones
# 2 Theon Greyjoy A Storm of Swords
# 3 Theon Greyjoy A Feast for Crows
# 4 Tyrion Lannister A Feast for Crows
# 5 Tyrion Lannister The World of Ice and Fire
# 6 Victarion Greyjoy A Game of Thrones
# 7 Victarion Greyjoy A Clash of Kings
# 8 Victarion Greyjoy A Storm of Swords
# 9 Will A Clash of Kings
# 10 Areo Hotah A Game of Thrones
# # ... with 67 more rows
Your tried solution works fine though if we filter the output (same output as my solution): 如果我们过滤输出(与我的解决方案相同的输出),你尝试过的解决方案工作正常:
df %>%
select(Name, Books) %>%
separate_rows(Books,sep="\"") %>%
filter(!Books %in% c("c(",", ",")") & lengths(Books)>0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.