R - 从具有可变数量分隔条目的列生成新列

Question

我有一张期刊出版物表，我想提取第一、第二和最后一位作者。

不幸的是，作者的数量差异很大，有的只有一个，有的多达 35 位。

如果一个出版物有一个作者，我希望只有一个第一作者。 如果有两个作者，我希望得到一个第一作者和最后一个作者。 如果有三位作者，我希望有第一位、倒数第二位和最后一位作者，依此类推。

这是原始数据集：

pub1 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4", 
        "pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3", 
        "author1, author2, author3, author4", "author1, author2, author3, author4, author5", 
        "author1, author2, author3, author4, author5, author6")), 
        class = "data.frame", row.names = c(NA, -6L))

这是预期的 output：

pub2 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4", 
        "pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3", 
        "author1, author2, author3, author4", "author1, author2, author3, author4, author5", 
        "author1, author2, author3, author4, author5, author6"),
        author_first = c("author1", "author1", "author1", "author1", "author1", "author1"),
        author_second_last = c("", ""," author2", " author3", " author4", " author5"),
        author_last = c("", " author2", " author3", " author4", " author5", " author6")),
        class = "data.frame", row.names = c(NA, -6L))

我不知道如何 go 关于这个。

Answer 1

这是一个关于如何使用dplyr和stringr的想法

library(dplyr)
library(stringr)

author_position = function(str, p, position) {
  stopifnot(is.numeric(position))
  # split the string up into a vector of pieces using a pattern (in this case `,`)
  # and trim the white space
  s = str_trim(str_split(str, p, simplify = TRUE))
  len = length(s)
  
  # Return NA if the author position chosen is greater than or equal to the length of the new vector
  # Caveat: If the position is 1, then return the value at the first position
  if(abs(position) >= len) {
    if(position == 1) {
      first(s)
    } else {
      NA
    }
  # Return the the value at the selected position 
  } else {
    nth(s, position)
  }
}

pub1 %>%
  rowwise() %>% # group by row
  mutate(author_first = author_position(authors,",",1),
         author_second_last = author_position(authors,",",-2),
         author_last = author_position(authors,",",-1))

# # A tibble: 6 × 5
# # Rowwise: 
#   publication authors                                              author_first author_second_last author_last
#   <chr>       <chr>                                                <chr>        <chr>              <chr>      
# 1 pub1        author1                                              author1      NA                 NA         
# 2 pub2        author1, author2                                     author1      NA                 author2    
# 3 pub3        author1, author2, author3                            author1      author2            author3    
# 4 pub4        author1, author2, author3, author4                   author1      author3            author4    
# 5 pub5        author1, author2, author3, author4, author5          author1      author4            author5    
# 6 pub6        author1, author2, author3, author4, author5, author6 author1      author5            author6

编辑：允许返回任何作者 position 并添加评论的能力。

这里唯一的限制是第一作者/最后作者是固定的。 因此，如果您想返回倒数第三位作者，而该出版物只有 3 位作者，它将返回 NA，因为从技术上讲，这被认为是第一位。 返回第 3 位作者也是如此，因为如果只有 3 位作者，那将被视为最后一位作者。

pub1 %>%
  rowwise() %>% # group by row
  mutate(author_third = author_position(authors,",",3),
         author_third_last = author_position(authors, ",", -3))


# # A tibble: 6 × 4
# # Rowwise: 
#   publication authors                                              author_third author_third_last
#   <chr>       <chr>                                                <chr>        <chr>            
# 1 pub1        author1                                              NA           NA               
# 2 pub2        author1, author2                                     NA           NA               
# 3 pub3        author1, author2, author3                            NA           NA               
# 4 pub4        author1, author2, author3, author4                   author3      author2          
# 5 pub5        author1, author2, author3, author4, author5          author3      author3          
# 6 pub6        author1, author2, author3, author4, author5, author6 author3      author4

R - 从具有可变数量分隔条目的列生成新列

问题描述

1 个解决方案

解决方案1
3 已采纳 2022-11-16 17:45:55

R - 从具有可变数量分隔条目的列生成新列

问题描述

1 个解决方案

解决方案1 3 已采纳 2022-11-16 17:45:55

解决方案1
3 已采纳 2022-11-16 17:45:55