简体   繁体   English

在R中,如何检查一个条目中的单词是否与另一个条目中的单词部分匹配

[英]in R, how to check if a word in an entry matches partially the word in another entry

Specifically, I'd like to check if a substring of the entry in one column is an exact match for one of the words in the entries in another column, but the non-substring parts cannot be too long (exceeding four characters) 具体来说,我想检查一列中的条目的子字符串是否与另一列中的条目中的单词之一完全匹配,但是非子字符串的部分不能太长(超过四个字符)

If I have a dataframe 如果我有一个数据框

df <- data.frame("name"=c("Denzel Washington","Andrew Garfield Junior","Ryan G Gosling"),"check"=c("Denzelboss","Garfield","Goslin"))

then I want the results to be 那我希望结果是

True, True, False

the first one because of one of the two words "Denzel" is a substring of the other entry (and the deviation string 'boss' is not longer than 4 characters), the second one because one of the three words, "Garfield," is contained in the other entry--it's an exact match, and the third because none of the three words is a substring of the entry in the 'check' column. 第一个是因为两个单词“ Denzel”之一是另一个条目的子字符串(并且偏差字符串“ boss”不超过4个字符),第二个是因为三个单词之一“ Garfield”包含在另一个条目中-完全匹配,而第三个则是完全匹配,因为这三个词都不是“检查”列中条目的子字符串。 ("Gosling" would return true) (“小鹅”将返回true)

All entries in the second column have only one word. 第二列中的所有条目只有一个单词。 I don't want to use a fuzzy matching algorithm because the word in the entry (like Denzel)should be an exact substring of the other entry "Denzelboss," but I also don't want to return true when the entry is "DenzelJohnson", where the deviation "Johnson" is too long. 我不想使用模糊匹配算法,因为条目中的单词(例如Denzel)应该是其他条目“ Denzelboss”的确切子字符串,但是当条目为“ DenzelJohnson”时,我也不想返回true ”,其中“ Johnson”的偏差过长。

Here I am running grepl in an mapply loop for each row and checking to make sure the difference in the length of each substring (number of characters - nchar ) is less than the limit of 4: 在这里,我在每行的mapply循环中运行grepl ,并检查以确保每个子字符串的长度差异(字符数nchar )小于4的限制:

df[] <- lapply(df, as.character)
mapply(
  function(sp,ck) any(sapply(sp, function(x) grepl(x,ck) & (nchar(ck)-nchar(x) <= 4))),
  strsplit(df$name,"\\s+"),
  df$check
)
#[1]  TRUE  TRUE FALSE

Your data frame stringsAsFactors=F 您的数据框stringsAsFactors=F

df <- data.frame("name"=c("Denzel Washington","Andrew Garfield Junior","Ryan G 

Gosling"),"check"=c("Denzelboss","Garfield","Goslin"),stringsAsFactors=F) 高斯林 “),” 检查 “= C(” Denzelboss”, “加菲猫”, “戈斯林”),stringsAsFactors = F)

I use iterators::iter to iterate over rows of df , and stringr verbs 我使用iterators::iter遍历dfstringr动词的行

Reduce("c", lapply(iter(df,by="row"), function(x) Reduce("any", mapply(function(y,z) ifelse(str_detect(z, y) & nchar(str_replace(z, y, "")) < 5, TRUE, FALSE), as.list(unlist(str_extract_all(x$name, boundary("word")))), x$check))))

[1]  TRUE  TRUE FALSE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM