[英]Subset a comma-delimited string from a list
This seems like a straightforward operation, but I seem to be stuck and am looking for pointers. 这似乎是一个简单的操作,但是我似乎被卡住了,正在寻找指针。
I have a dataframe of authors and their associated publications. 我有一个作者及其相关出版物的数据框。 In the author
column, there are often times multiple authors for a single article in a semicolon delimited list. 在author
专栏中,分号分隔列表中的单个文章通常有多个作者。 Here's a small subset: 这是一小部分:
structure(list(author = c("Moscatelli, Adriana; Nishina, Adrienne",
"Asangba, Abigail", "Stewart, Abigail", "Redmond-Sanogo, Adrienne; Lee, Ahlam",
"Purnamasari, Agustina; Lee, Ahlam; Moscatelli, Adriana",
"Nishina, Adrienne", "Lee, Ahlam",
"Lee, Ahlam; Cloutier, Aimee", "Kleihauer, Jay; Stephens, Roy; Hart, William",
"Foor, Ryan M.; Cano, Jamie"), pubtitle = c("AIP Conference Proceedings",
"Journal of Case Studies in Accreditation and Assessment", "173rd Meeting of Acoustical Society of America",
"Journal of Research in Gender Studies", "Journal of Research in Gender Studies",
"Scientometrics", "Journal of Agricultural Education", "Journal of Agricultural Education",
"Journal of Agricultural Education", "Journal of Agricultural Education"
)), class = c("rowwise_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-10L))
I have a second data frame that has just the author names. 我还有第二个数据框,其中只有作者姓名。 Here's a subset of those names, for reproducibility: 为了重现性,以下是这些名称的子集:
structure(list(author = c("Asangba, Abigail", "Stewart, Abigail",
"Moscatelli, Adriana", "Nishina, Adrienne", "Redmond-Sanogo, Adrienne",
"Purnamasari, Agustina", "Lee, Ahlam", "Aliyeva, Aida", "Belanger, Aimee",
"Cloutier, Aimee")), row.names = c(NA, 10L), class = "data.frame")
I'm trying to use this second data frame to subset data from the original data frame, and I'm running into a challenge with the semicolon delimited names. 我正在尝试使用第二个数据帧从原始数据帧中提取数据子集,并且使用分号分隔的名称也遇到了挑战。
I thought this would get me there, but no luck so far. 我以为这可以带我到那里,但是到目前为止还没有运气。 I've tried to change the delimited string into a vector and then match against the list of authors, but it only returns names that appear individually (or, I get no matches in names that appear in the string). 我试图将带分隔符的字符串更改为向量,然后与作者列表进行匹配,但它只返回单独出现的名称(或者,字符串中出现的名称不匹配)。
list_authors_female <- data %>%
select(author, pubtitle) %>%
filter(author %in% female_authors_all)
Here, I tried to separate the author
column into a vector, but I'm hitting an error. 在这里,我试图将author
列分成一个向量,但遇到错误。
list_authors_female <- data %>%
rowwise() %>%
mutate(author_list = str_split(author, pattern = ";")) %>%
filter(author_list %in% female_authors_all)
Any pointers? 有指针吗? Thanks! 谢谢!
Create a regular expression pat
of the form author1|author2|...|authorN
and apply it to the pubs
. 创建一个author1|author2|...|authorN
形式的正则表达式pat
并将其应用于pubs
。 With this approach no splitting is needed. 使用这种方法,不需要拆分。
pat <- authors %>%
rowwise %>%
mutate(author = toString(author)) %>%
ungroup %>%
{ paste(.$author, collapse = "|") }
pubs %>% filter(grepl(pat, author))
giving: 给予:
# A tibble: 8 x 2
author pubtitle
<chr> <chr>
1 Moscatelli, Adriana; Nishina, Adrienne AIP Conference Proceedings
2 Asangba, Abigail Journal of Case Studies in Accreditati~
3 Stewart, Abigail 173rd Meeting of Acoustical Society of~
4 Redmond-Sanogo, Adrienne; Lee, Ahlam Journal of Research in Gender Studies
5 Purnamasari, Agustina; Lee, Ahlam; Mo~ Journal of Research in Gender Studies
6 Nishina, Adrienne Scientometrics
7 Lee, Ahlam Journal of Agricultural Education
8 Lee, Ahlam; Cloutier, Aimee Journal of Agricultural Education
We can use tidyverse
approach. 我们可以使用tidyverse
方法。 Separate the 'author' at the :
delimiter into 'long' format, then do an inner_join
, later grouped by the row number column already created, paste
the 'author' elements back to a single string 将:
分隔符处的'author'分隔为'long'格式,然后执行一次inner_join
,稍后按已创建的行号列分组,将'author'元素paste
回单个字符串
library(tidyverse)
df1 %>%
rownames_to_column('rn') %>%
separate_rows(author, sep=";\\s*") %>%
inner_join(df2)%>%
group_by(rn, pubtitle) %>%
summarise(author = str_c(author, collapse = "; ")) %>%
ungroup %>%
select(names(df1))
# A tibble: 8 x 2
# author pubtitle
# <chr> <chr>
#1 Moscatelli, Adriana; Nishina, Adrienne AIP Conference Proceedings
#2 Asangba, Abigail Journal of Case Studies in Accreditation and Assessment
#3 Stewart, Abigail 173rd Meeting of Acoustical Society of America
#4 Redmond-Sanogo, Adrienne; Lee, Ahlam Journal of Research in Gender Studies
#5 Purnamasari, Agustina; Lee, Ahlam; Moscatelli, Adriana Journal of Research in Gender Studies
#6 Nishina, Adrienne Scientometrics
#7 Lee, Ahlam Journal of Agricultural Education
#8 Lee, Ahlam; Cloutier, Aimee Journal of Agricultural Education
Or with str_detect
and filter
或使用str_detect
和filter
df1 %>%
filter(str_detect(author, str_c(df2$author, collapse="|")))
If you're willing to use the tidyr
package there are some cool tools for separating out delimited lists. 如果您愿意使用tidyr
软件包,则可以使用一些很酷的工具来分隔定界列表。 Specifically separate
and separate_row
. 具体来说, separate
和separate_row
。
data
# # A tibble: 10 x 2
# author pubtitle
# <chr> <chr>
# 1 Moscatelli, Adriana; Nishina, Adrienne AIP Conference Proceedings
# 2 Asangba, Abigail Journal of Case Studies in Accreditation and ~
# 3 Stewart, Abigail 173rd Meeting of Acoustical Society of America
# 4 Redmond-Sanogo, Adrienne; Lee, Ahlam Journal of Research in Gender Studies
# 5 Purnamasari, Agustina; Lee, Ahlam; Moscatell~ Journal of Research in Gender Studies
# 6 Nishina, Adrienne Scientometrics
# 7 Lee, Ahlam Journal of Agricultural Education
# 8 Lee, Ahlam; Cloutier, Aimee Journal of Agricultural Education
# 9 Kleihauer, Jay; Stephens, Roy; Hart, William Journal of Agricultural Education
# 10 Foor, Ryan M.; Cano, Jamie Journal of Agricultural Education
female_authors_all
# # A tibble: 10 x 1
# author
# 1 Asangba, Abigail
# 2 Stewart, Abigail
# 3 Moscatelli, Adriana
# 4 Nishina, Adrienne
# 5 Redmond-Sanogo, Adrienne
# 6 Purnamasari, Agustina
# 7 Lee, Ahlam
# 8 Aliyeva, Aida
# 9 Belanger, Aimee
# 10 Cloutier, Aimee
data2 <- data %>%
# If you want to keep the original names duplicate column first
mutate(author_sep = author) %>%
# Take each delimited author and give them their own row (tidy data)
tidyr::separate_rows(author_sep,sep = ";") %>%
# Filter to only keep rows where the individual author is the other vector
filter(author_sep %in% female_authors_all$author) %>%
# Remove that extra column we created
select(-author_sep) %>%
# Remove duplicate rows in case more than one author in the delimited list was female
distinct()
data2
# # A tibble: 8 x 2
# author pubtitle
# <chr> <chr>
# 1 Moscatelli, Adriana; Nishina, Adrienne AIP Conference Proceedings
# 2 Asangba, Abigail Journal of Case Studies in Accreditation and ~
# 3 Stewart, Abigail 173rd Meeting of Acoustical Society of America
# 4 Redmond-Sanogo, Adrienne; Lee, Ahlam Journal of Research in Gender Studies
# 5 Purnamasari, Agustina; Lee, Ahlam; Moscatelli~ Journal of Research in Gender Studies
# 6 Nishina, Adrienne Scientometrics
# 7 Lee, Ahlam Journal of Agricultural Education
# 8 Lee, Ahlam; Cloutier, Aimee Journal of Agricultural Education
Or using inner_join
which is more efficient than %in%
: 或者使用比%in%
更有效的inner_join
:
data3 <- data %>%
# If you want to keep the original names duplicate column first
mutate(author_sep = author) %>%
# Take each delimited author and give them their own row (tidy data)
tidyr::separate_rows(author_sep,sep = ";") %>%
# inner_join to keep only females
inner_join(female_authors_all,by = c("author_sep" = "author")) %>%
# Remove that extra column we created
select(-author_sep) %>%
# Remove duplicate rows in case more than one author is the delimited list was female
distinct()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.