简体   繁体   中英

R - generating new columns from a column with a variable number of delimited entries

I've got a table of journal publications, and I'd like to extract the 1st, 2nd last and last authors.

Unfortunately, the number of authors varies a lot, with some having one and some as many as 35.

If a publication has one author, I expect to get just one first author. I hope to get a first and last author if there are two authors. If there are three authors, I expect a first, second last and last author and so on.

Here's the original dataset:

pub1 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4", 
        "pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3", 
        "author1, author2, author3, author4", "author1, author2, author3, author4, author5", 
        "author1, author2, author3, author4, author5, author6")), 
        class = "data.frame", row.names = c(NA, -6L))

And here's an expected output:

pub2 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4", 
        "pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3", 
        "author1, author2, author3, author4", "author1, author2, author3, author4, author5", 
        "author1, author2, author3, author4, author5, author6"),
        author_first = c("author1", "author1", "author1", "author1", "author1", "author1"),
        author_second_last = c("", ""," author2", " author3", " author4", " author5"),
        author_last = c("", " author2", " author3", " author4", " author5", " author6")),
        class = "data.frame", row.names = c(NA, -6L))

I have no idea how to go about this.

Here's an idea of how to do it using dplyr and stringr

library(dplyr)
library(stringr)

author_position = function(str, p, position) {
  stopifnot(is.numeric(position))
  # split the string up into a vector of pieces using a pattern (in this case `,`)
  # and trim the white space
  s = str_trim(str_split(str, p, simplify = TRUE))
  len = length(s)
  
  # Return NA if the author position chosen is greater than or equal to the length of the new vector
  # Caveat: If the position is 1, then return the value at the first position
  if(abs(position) >= len) {
    if(position == 1) {
      first(s)
    } else {
      NA
    }
  # Return the the value at the selected position 
  } else {
    nth(s, position)
  }
}

pub1 %>%
  rowwise() %>% # group by row
  mutate(author_first = author_position(authors,",",1),
         author_second_last = author_position(authors,",",-2),
         author_last = author_position(authors,",",-1))

# # A tibble: 6 × 5
# # Rowwise: 
#   publication authors                                              author_first author_second_last author_last
#   <chr>       <chr>                                                <chr>        <chr>              <chr>      
# 1 pub1        author1                                              author1      NA                 NA         
# 2 pub2        author1, author2                                     author1      NA                 author2    
# 3 pub3        author1, author2, author3                            author1      author2            author3    
# 4 pub4        author1, author2, author3, author4                   author1      author3            author4    
# 5 pub5        author1, author2, author3, author4, author5          author1      author4            author5    
# 6 pub6        author1, author2, author3, author4, author5, author6 author1      author5            author6 

Edited: To allow capability to return any author position and added comments.

The only constraint here is that the first/last authors are fixed. So if you want to return the 3rd to last author and there are only 3 authors for the publication, it will return NA since technically that's considered to be the first. Same goes for returning the 3rd author as that would be considered to be the last author if there are only 3 authors.

pub1 %>%
  rowwise() %>% # group by row
  mutate(author_third = author_position(authors,",",3),
         author_third_last = author_position(authors, ",", -3))


# # A tibble: 6 × 4
# # Rowwise: 
#   publication authors                                              author_third author_third_last
#   <chr>       <chr>                                                <chr>        <chr>            
# 1 pub1        author1                                              NA           NA               
# 2 pub2        author1, author2                                     NA           NA               
# 3 pub3        author1, author2, author3                            NA           NA               
# 4 pub4        author1, author2, author3, author4                   author3      author2          
# 5 pub5        author1, author2, author3, author4, author5          author3      author3          
# 6 pub6        author1, author2, author3, author4, author5, author6 author3      author4  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM