简体   繁体   English

从列表子集以逗号分隔的字符串

[英]Subset a comma-delimited string from a list

This seems like a straightforward operation, but I seem to be stuck and am looking for pointers. 这似乎是一个简单的操作,但是我似乎被卡住了,正在寻找指针。

I have a dataframe of authors and their associated publications. 我有一个作者及其相关出版物的数据框。 In the author column, there are often times multiple authors for a single article in a semicolon delimited list. author专栏中,分号分隔列表中的单个文章通常有多个作者。 Here's a small subset: 这是一小部分:

structure(list(author = c("Moscatelli, Adriana; Nishina, Adrienne", 
"Asangba, Abigail", "Stewart, Abigail", "Redmond-Sanogo, Adrienne; Lee, Ahlam", 
"Purnamasari, Agustina; Lee, Ahlam; Moscatelli, Adriana", 
"Nishina, Adrienne", "Lee, Ahlam", 
"Lee, Ahlam; Cloutier, Aimee", "Kleihauer, Jay; Stephens, Roy; Hart, William", 
"Foor, Ryan M.; Cano, Jamie"), pubtitle = c("AIP Conference Proceedings", 
"Journal of Case Studies in Accreditation and Assessment", "173rd Meeting of Acoustical Society of America", 
"Journal of Research in Gender Studies", "Journal of Research in Gender Studies", 
"Scientometrics", "Journal of Agricultural Education", "Journal of Agricultural Education", 
"Journal of Agricultural Education", "Journal of Agricultural Education"
)), class = c("rowwise_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-10L))

I have a second data frame that has just the author names. 我还有第二个数据框,其中只有作者姓名。 Here's a subset of those names, for reproducibility: 为了重现性,以下是这些名称的子集:

structure(list(author = c("Asangba, Abigail", "Stewart, Abigail", 
"Moscatelli, Adriana", "Nishina, Adrienne", "Redmond-Sanogo, Adrienne", 
"Purnamasari, Agustina", "Lee, Ahlam", "Aliyeva, Aida", "Belanger, Aimee", 
"Cloutier, Aimee")), row.names = c(NA, 10L), class = "data.frame")

I'm trying to use this second data frame to subset data from the original data frame, and I'm running into a challenge with the semicolon delimited names. 我正在尝试使用第二个数据帧从原始数据帧中提取数据子集,并且使用分号分隔的名称也遇到了挑战。

I thought this would get me there, but no luck so far. 我以为这可以带我到那里,但是到目前为止还没有运气。 I've tried to change the delimited string into a vector and then match against the list of authors, but it only returns names that appear individually (or, I get no matches in names that appear in the string). 我试图将带分隔符的字符串更改为向量,然后与作者列表进行匹配,但它只返回单独出现的名称(或者,字符串中出现的名称不匹配)。

list_authors_female <- data %>% 
  select(author, pubtitle) %>% 
  filter(author %in% female_authors_all)

Here, I tried to separate the author column into a vector, but I'm hitting an error. 在这里,我试图将author列分成一个向量,但遇到错误。

list_authors_female <- data %>%  
  rowwise() %>% 
  mutate(author_list = str_split(author, pattern = ";")) %>% 
  filter(author_list %in% female_authors_all)

Any pointers? 有指针吗? Thanks! 谢谢!

Create a regular expression pat of the form author1|author2|...|authorN and apply it to the pubs . 创建一个author1|author2|...|authorN形式的正则表达式pat并将其应用于pubs With this approach no splitting is needed. 使用这种方法,不需要拆分。

pat <- authors %>% 
  rowwise %>% 
  mutate(author = toString(author)) %>%
  ungroup %>%
  { paste(.$author, collapse = "|") }

pubs %>% filter(grepl(pat, author))

giving: 给予:

# A tibble: 8 x 2
  author                                 pubtitle                               
  <chr>                                  <chr>                                  
1 Moscatelli, Adriana; Nishina, Adrienne AIP Conference Proceedings             
2 Asangba, Abigail                       Journal of Case Studies in Accreditati~
3 Stewart, Abigail                       173rd Meeting of Acoustical Society of~
4 Redmond-Sanogo, Adrienne; Lee, Ahlam   Journal of Research in Gender Studies  
5 Purnamasari, Agustina; Lee, Ahlam; Mo~ Journal of Research in Gender Studies  
6 Nishina, Adrienne                      Scientometrics                         
7 Lee, Ahlam                             Journal of Agricultural Education      
8 Lee, Ahlam; Cloutier, Aimee            Journal of Agricultural Education  

We can use tidyverse approach. 我们可以使用tidyverse方法。 Separate the 'author' at the : delimiter into 'long' format, then do an inner_join , later grouped by the row number column already created, paste the 'author' elements back to a single string :分隔符处的'author'分隔为'long'格式,然后执行一次inner_join ,稍后按已创建的行号列分组,将'author'元素paste回单个字符串

library(tidyverse)
df1 %>%
  rownames_to_column('rn') %>% 
  separate_rows(author, sep=";\\s*") %>%
  inner_join(df2)%>% 
  group_by(rn, pubtitle) %>% 
  summarise(author = str_c(author, collapse = "; ")) %>%
  ungroup %>%
  select(names(df1))
# A tibble: 8 x 2
#  author                                                 pubtitle                                               
#  <chr>                                                  <chr>                                                  
#1 Moscatelli, Adriana; Nishina, Adrienne                 AIP Conference Proceedings                             
#2 Asangba, Abigail                                       Journal of Case Studies in Accreditation and Assessment
#3 Stewart, Abigail                                       173rd Meeting of Acoustical Society of America         
#4 Redmond-Sanogo, Adrienne; Lee, Ahlam                   Journal of Research in Gender Studies                  
#5 Purnamasari, Agustina; Lee, Ahlam; Moscatelli, Adriana Journal of Research in Gender Studies                  
#6 Nishina, Adrienne                                      Scientometrics                                         
#7 Lee, Ahlam                                             Journal of Agricultural Education                      
#8 Lee, Ahlam; Cloutier, Aimee                            Journal of Agricultural Education         

Or with str_detect and filter 或使用str_detectfilter

df1 %>% 
    filter(str_detect(author, str_c(df2$author, collapse="|")))

If you're willing to use the tidyr package there are some cool tools for separating out delimited lists. 如果您愿意使用tidyr软件包,则可以使用一些很酷的工具来分隔定界列表。 Specifically separate and separate_row . 具体来说, separateseparate_row

data
# # A tibble: 10 x 2
#   author                                        pubtitle                                      
#   <chr>                                         <chr>                                         
# 1 Moscatelli, Adriana; Nishina, Adrienne        AIP Conference Proceedings                    
# 2 Asangba, Abigail                              Journal of Case Studies in Accreditation and ~
# 3 Stewart, Abigail                              173rd Meeting of Acoustical Society of America
# 4 Redmond-Sanogo, Adrienne; Lee, Ahlam          Journal of Research in Gender Studies         
# 5 Purnamasari, Agustina; Lee, Ahlam; Moscatell~ Journal of Research in Gender Studies         
# 6 Nishina, Adrienne                             Scientometrics                                
# 7 Lee, Ahlam                                    Journal of Agricultural Education             
# 8 Lee, Ahlam; Cloutier, Aimee                   Journal of Agricultural Education             
# 9 Kleihauer, Jay; Stephens, Roy; Hart, William  Journal of Agricultural Education             
# 10 Foor, Ryan M.; Cano, Jamie                    Journal of Agricultural Education        

female_authors_all
# # A tibble: 10 x 1
#                      author
# 1          Asangba, Abigail
# 2          Stewart, Abigail
# 3       Moscatelli, Adriana
# 4         Nishina, Adrienne
# 5  Redmond-Sanogo, Adrienne
# 6     Purnamasari, Agustina
# 7                Lee, Ahlam
# 8             Aliyeva, Aida
# 9           Belanger, Aimee
# 10          Cloutier, Aimee

data2 <- data %>%
  # If you want to keep the original names duplicate column first
  mutate(author_sep = author) %>%
  # Take each delimited author and give them their own row (tidy data)
  tidyr::separate_rows(author_sep,sep = ";") %>%
  # Filter to only keep rows where the individual author is the other vector
  filter(author_sep %in% female_authors_all$author) %>%
  # Remove that extra column we created
  select(-author_sep) %>%
  # Remove duplicate rows in case more than one author in the delimited list was female
  distinct()

data2
# # A tibble: 8 x 2
#   author                                         pubtitle                                      
#   <chr>                                          <chr>                                         
# 1 Moscatelli, Adriana; Nishina, Adrienne         AIP Conference Proceedings                    
# 2 Asangba, Abigail                               Journal of Case Studies in Accreditation and ~
# 3 Stewart, Abigail                               173rd Meeting of Acoustical Society of America
# 4 Redmond-Sanogo, Adrienne; Lee, Ahlam           Journal of Research in Gender Studies         
# 5 Purnamasari, Agustina; Lee, Ahlam; Moscatelli~ Journal of Research in Gender Studies         
# 6 Nishina, Adrienne                              Scientometrics                                
# 7 Lee, Ahlam                                     Journal of Agricultural Education             
# 8 Lee, Ahlam; Cloutier, Aimee                    Journal of Agricultural Education   

Or using inner_join which is more efficient than %in% : 或者使用比%in%更有效的inner_join

data3 <- data %>%
  # If you want to keep the original names duplicate column first
  mutate(author_sep = author) %>%
  # Take each delimited author and give them their own row (tidy data)
  tidyr::separate_rows(author_sep,sep = ";") %>%
  # inner_join to keep only females
  inner_join(female_authors_all,by = c("author_sep" = "author")) %>%
  # Remove that extra column we created
  select(-author_sep) %>%
  # Remove duplicate rows in case more than one author is the delimited list was female
  distinct()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM