![](/img/trans.png)
[英]split extra-delimited column from prokka gff table with varying number of entries into new columns with NAs (splitstackshape / R)
[英]R - generating new columns from a column with a variable number of delimited entries
我有一张期刊出版物表,我想提取第一、第二和最后一位作者。
不幸的是,作者的数量差异很大,有的只有一个,有的多达 35 位。
如果一个出版物有一个作者,我希望只有一个第一作者。 如果有两个作者,我希望得到一个第一作者和最后一个作者。 如果有三位作者,我希望有第一位、倒数第二位和最后一位作者,依此类推。
这是原始数据集:
pub1 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4",
"pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3",
"author1, author2, author3, author4", "author1, author2, author3, author4, author5",
"author1, author2, author3, author4, author5, author6")),
class = "data.frame", row.names = c(NA, -6L))
这是预期的 output:
pub2 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4",
"pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3",
"author1, author2, author3, author4", "author1, author2, author3, author4, author5",
"author1, author2, author3, author4, author5, author6"),
author_first = c("author1", "author1", "author1", "author1", "author1", "author1"),
author_second_last = c("", ""," author2", " author3", " author4", " author5"),
author_last = c("", " author2", " author3", " author4", " author5", " author6")),
class = "data.frame", row.names = c(NA, -6L))
我不知道如何 go 关于这个。
这是一个关于如何使用dplyr
和stringr
的想法
library(dplyr)
library(stringr)
author_position = function(str, p, position) {
stopifnot(is.numeric(position))
# split the string up into a vector of pieces using a pattern (in this case `,`)
# and trim the white space
s = str_trim(str_split(str, p, simplify = TRUE))
len = length(s)
# Return NA if the author position chosen is greater than or equal to the length of the new vector
# Caveat: If the position is 1, then return the value at the first position
if(abs(position) >= len) {
if(position == 1) {
first(s)
} else {
NA
}
# Return the the value at the selected position
} else {
nth(s, position)
}
}
pub1 %>%
rowwise() %>% # group by row
mutate(author_first = author_position(authors,",",1),
author_second_last = author_position(authors,",",-2),
author_last = author_position(authors,",",-1))
# # A tibble: 6 × 5
# # Rowwise:
# publication authors author_first author_second_last author_last
# <chr> <chr> <chr> <chr> <chr>
# 1 pub1 author1 author1 NA NA
# 2 pub2 author1, author2 author1 NA author2
# 3 pub3 author1, author2, author3 author1 author2 author3
# 4 pub4 author1, author2, author3, author4 author1 author3 author4
# 5 pub5 author1, author2, author3, author4, author5 author1 author4 author5
# 6 pub6 author1, author2, author3, author4, author5, author6 author1 author5 author6
编辑:允许返回任何作者 position 并添加评论的能力。
这里唯一的限制是第一作者/最后作者是固定的。 因此,如果您想返回倒数第三位作者,而该出版物只有 3 位作者,它将返回 NA,因为从技术上讲,这被认为是第一位。 返回第 3 位作者也是如此,因为如果只有 3 位作者,那将被视为最后一位作者。
pub1 %>%
rowwise() %>% # group by row
mutate(author_third = author_position(authors,",",3),
author_third_last = author_position(authors, ",", -3))
# # A tibble: 6 × 4
# # Rowwise:
# publication authors author_third author_third_last
# <chr> <chr> <chr> <chr>
# 1 pub1 author1 NA NA
# 2 pub2 author1, author2 NA NA
# 3 pub3 author1, author2, author3 NA NA
# 4 pub4 author1, author2, author3, author4 author3 author2
# 5 pub5 author1, author2, author3, author4, author5 author3 author3
# 6 pub6 author1, author2, author3, author4, author5, author6 author3 author4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.