[英]How to find a subset of names in another column?
I have a list of file names that look like this:我有一个文件名列表,如下所示:
files$name <-c("RePEc.aad.ejbejj.v.1.y.2010.i.0.p.84.pdf", "RePEc.aad.ejbejj.v.12.y.2017.i.2.p.1117.pdf", "RePEc.aad.ejbejj.v.2.y.2011.i.0.p.17.20.pdf", "RePEc.aad.ejbejj.v.2.y.2011.i.0.p.60.62.pdf")
I have a much longer list of IDs, which is a column of a larger dataframe, some of which correspond to the list of file names ( names
) but these names have different puncutation.我有一个更长的ID列表,这是一个更大的dataframe的列,其中一些对应于文件名列表( names
)但是这些名称具有不同的标点符号。 The column looks like this:该列如下所示:
df$repec_id <- c("RePEc:aad.ejbejj:v:1:y:2010:i:0:p:84", "RePEc:aad:ejbejj:v:12:y.2017:i:2:p:1117", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:17-20", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:60-62", "RePEc:aad.ejbejj:v:1:y:2010:i:0:p:99","RePEc:aad.ejbejj:v:1:y:2010:i:0:p:103")
I want to subset the list in df$repec_id
so that I have only the strings that correspond to file names in files$name
but they have different punctuation.我想对df$repec_id
中的列表进行子集化,以便我只有与files$name
中的文件名相对应的字符串,但它们具有不同的标点符号。 In other words, I want an output that looks like this:换句话说,我想要一个如下所示的 output:
ID_subset <- c("RePEc:aad.ejbejj:v:1:y:2010:i:0:p:84", "RePEc:aad:ejbejj:v:12:y.2017:i:2:p:1117", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:17-20", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:60-62")
Initially, I thought that removing all the special characters from both lists and then comparing them would work.最初,我认为从两个列表中删除所有特殊字符然后比较它们会起作用。 So I did this:所以我这样做了:
files$name <- str_replace_all(files$name, "\\.pdf", "")
files$name <- str_replace_all(files$name, "[[:punct:]]", "")
df$repec_id <- str_replace_all(files$name, "[[:punct:]]", "")
subset <- df[trimws(df$repec_id) %in% trimws(files$name), ]
However, I need a way of preserving the original structure of the IDs in df$repec_id
because I need to provide a list of IDs from df$repec_id
that are/ are not in the subset.但是,我需要一种方法来保留df$repec_id
中 ID 的原始结构,因为我需要提供df$repec_id
中不在子集中的 ID 列表。 Does anyone have any suggestions?有没有人有什么建议? Thanks in advance for your help!在此先感谢您的帮助!
You can remove all punctuations from repec_id
and name
and use %in%
to find out the strings that match.您可以从repec_id
和name
中删除所有标点符号,并使用%in%
找出匹配的字符串。
gsub('[[:punct:]]', '', df$repec_id) %in%
gsub('\\.pdf$|[[:punct:]]', '',files$name)
#[1] TRUE TRUE TRUE TRUE FALSE FALSE
If you add negation( !
) sign to this you would get strings that do not match.如果您向其中添加否定 ( !
) 符号,您将得到不匹配的字符串。
!gsub('[[:punct:]]', '', df$repec_id) %in%
gsub('\\.pdf$|[[:punct:]]', '',files$name)
#[1] FALSE FALSE FALSE FALSE TRUE TRUE
This maintains the length same as df$repec_id
so you can use this to subset rows from df
.这保持与df$repec_id
相同的长度,因此您可以使用它来对df
中的行进行子集化。
We can use我们可以用
!gsub('[^[:alnum:]]+', '', df$repec_id) %in% gsub('\\.pdf$|[^[:alnum:]]', '',files$name)
#[1] FALSE FALSE FALSE FALSE TRUE TRUE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.