简体   繁体   English

如何在另一列中查找名称的子集?

[英]How to find a subset of names in another column?

I have a list of file names that look like this:我有一个文件名列表,如下所示:

files$name <-c("RePEc.aad.ejbejj.v.1.y.2010.i.0.p.84.pdf", "RePEc.aad.ejbejj.v.12.y.2017.i.2.p.1117.pdf", "RePEc.aad.ejbejj.v.2.y.2011.i.0.p.17.20.pdf", "RePEc.aad.ejbejj.v.2.y.2011.i.0.p.60.62.pdf")

I have a much longer list of IDs, which is a column of a larger dataframe, some of which correspond to the list of file names ( names ) but these names have different puncutation.我有一个更长的ID列表,这是一个更大的dataframe的列,其中一些对应于文件名列表( names )但是这些名称具有不同的标点符号。 The column looks like this:该列如下所示:

df$repec_id <- c("RePEc:aad.ejbejj:v:1:y:2010:i:0:p:84", "RePEc:aad:ejbejj:v:12:y.2017:i:2:p:1117", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:17-20", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:60-62", "RePEc:aad.ejbejj:v:1:y:2010:i:0:p:99","RePEc:aad.ejbejj:v:1:y:2010:i:0:p:103")

I want to subset the list in df$repec_id so that I have only the strings that correspond to file names in files$name but they have different punctuation.我想对df$repec_id中的列表进行子集化,以便我只有与files$name中的文件名相对应的字符串,但它们具有不同的标点符号。 In other words, I want an output that looks like this:换句话说,我想要一个如下所示的 output:

ID_subset <- c("RePEc:aad.ejbejj:v:1:y:2010:i:0:p:84", "RePEc:aad:ejbejj:v:12:y.2017:i:2:p:1117", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:17-20", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:60-62")

Initially, I thought that removing all the special characters from both lists and then comparing them would work.最初,我认为从两个列表中删除所有特殊字符然后比较它们会起作用。 So I did this:所以我这样做了:

files$name <- str_replace_all(files$name, "\\.pdf", "")
files$name <- str_replace_all(files$name, "[[:punct:]]", "")
df$repec_id <- str_replace_all(files$name, "[[:punct:]]", "")
subset <- df[trimws(df$repec_id) %in% trimws(files$name), ]

However, I need a way of preserving the original structure of the IDs in df$repec_id because I need to provide a list of IDs from df$repec_id that are/ are not in the subset.但是,我需要一种方法来保留df$repec_id中 ID 的原始结构,因为我需要提供df$repec_id中不在子集中的 ID 列表。 Does anyone have any suggestions?有没有人有什么建议? Thanks in advance for your help!在此先感谢您的帮助!

You can remove all punctuations from repec_id and name and use %in% to find out the strings that match.您可以从repec_idname中删除所有标点符号,并使用%in%找出匹配的字符串。

gsub('[[:punct:]]', '', df$repec_id) %in% 
          gsub('\\.pdf$|[[:punct:]]', '',files$name) 
#[1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

If you add negation( ! ) sign to this you would get strings that do not match.如果您向其中添加否定 ( ! ) 符号,您将得到不匹配的字符串。

!gsub('[[:punct:]]', '', df$repec_id) %in% 
       gsub('\\.pdf$|[[:punct:]]', '',files$name) 
#[1] FALSE FALSE FALSE FALSE  TRUE  TRUE

This maintains the length same as df$repec_id so you can use this to subset rows from df .这保持与df$repec_id相同的长度,因此您可以使用它来对df中的行进行子集化。

We can use我们可以用

!gsub('[^[:alnum:]]+', '', df$repec_id) %in% gsub('\\.pdf$|[^[:alnum:]]', '',files$name)
#[1] FALSE FALSE FALSE FALSE  TRUE  TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM