如何在另一列中查找名称的子集？

Question

I have a list of file names that look like this:我有一个文件名列表，如下所示：

files$name <-c("RePEc.aad.ejbejj.v.1.y.2010.i.0.p.84.pdf", "RePEc.aad.ejbejj.v.12.y.2017.i.2.p.1117.pdf", "RePEc.aad.ejbejj.v.2.y.2011.i.0.p.17.20.pdf", "RePEc.aad.ejbejj.v.2.y.2011.i.0.p.60.62.pdf")

I have a much longer list of IDs, which is a column of a larger dataframe, some of which correspond to the list of file names ( names ) but these names have different puncutation.我有一个更长的ID列表，这是一个更大的dataframe的列，其中一些对应于文件名列表（ names ）但是这些名称具有不同的标点符号。 The column looks like this:该列如下所示：

df$repec_id <- c("RePEc:aad.ejbejj:v:1:y:2010:i:0:p:84", "RePEc:aad:ejbejj:v:12:y.2017:i:2:p:1117", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:17-20", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:60-62", "RePEc:aad.ejbejj:v:1:y:2010:i:0:p:99","RePEc:aad.ejbejj:v:1:y:2010:i:0:p:103")

I want to subset the list in df$repec_id so that I have only the strings that correspond to file names in files$name but they have different punctuation.我想对df$repec_id中的列表进行子集化，以便我只有与files$name中的文件名相对应的字符串，但它们具有不同的标点符号。 In other words, I want an output that looks like this:换句话说，我想要一个如下所示的 output：

ID_subset <- c("RePEc:aad.ejbejj:v:1:y:2010:i:0:p:84", "RePEc:aad:ejbejj:v:12:y.2017:i:2:p:1117", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:17-20", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:60-62")

Initially, I thought that removing all the special characters from both lists and then comparing them would work.最初，我认为从两个列表中删除所有特殊字符然后比较它们会起作用。 So I did this:所以我这样做了：

files$name <- str_replace_all(files$name, "\\.pdf", "")
files$name <- str_replace_all(files$name, "[[:punct:]]", "")
df$repec_id <- str_replace_all(files$name, "[[:punct:]]", "")
subset <- df[trimws(df$repec_id) %in% trimws(files$name), ]

However, I need a way of preserving the original structure of the IDs in df$repec_id because I need to provide a list of IDs from df$repec_id that are/ are not in the subset.但是，我需要一种方法来保留df$repec_id中 ID 的原始结构，因为我需要提供df$repec_id中不在子集中的 ID 列表。 Does anyone have any suggestions?有没有人有什么建议？ Thanks in advance for your help!在此先感谢您的帮助！

Answer 1

You can remove all punctuations from repec_id and name and use %in% to find out the strings that match.您可以从repec_id和name中删除所有标点符号，并使用%in%找出匹配的字符串。

gsub('[[:punct:]]', '', df$repec_id) %in% 
          gsub('\\.pdf$|[[:punct:]]', '',files$name) 
#[1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

If you add negation( ! ) sign to this you would get strings that do not match.如果您向其中添加否定 ( ! ) 符号，您将得到不匹配的字符串。

!gsub('[[:punct:]]', '', df$repec_id) %in% 
       gsub('\\.pdf$|[[:punct:]]', '',files$name) 
#[1] FALSE FALSE FALSE FALSE  TRUE  TRUE

This maintains the length same as df$repec_id so you can use this to subset rows from df .这保持与df$repec_id相同的长度，因此您可以使用它来对df中的行进行子集化。

Answer 2

We can use我们可以用

!gsub('[^[:alnum:]]+', '', df$repec_id) %in% gsub('\\.pdf$|[^[:alnum:]]', '',files$name)
#[1] FALSE FALSE FALSE FALSE  TRUE  TRUE

如何在另一列中查找名称的子集？

问题描述

2 个解决方案

解决方案1
1 2020-05-16 05:38:10

解决方案2
1 已采纳 2020-05-16 20:17:45

如何在另一列中查找名称的子集？

问题描述

2 个解决方案

解决方案1 1 2020-05-16 05:38:10

解决方案2 1 已采纳 2020-05-16 20:17:45

解决方案1
1 2020-05-16 05:38:10

解决方案2
1 已采纳 2020-05-16 20:17:45