简体   繁体   English

如何模糊匹配 r 中的两个字符向量

[英]How to fuzzy match two character vectors in r

Context语境

I have a df ,where the id refers to a different person and the fruits_eat refers to the fruit that person eats.我有一个df ,其中id指的是另一个人,而fruits_eat指的是那个人吃的水果。 Also, I have a vector fruits_list storing a list of fruits.另外,我有一个向量fruits_list存储水果列表。

Question问题

I want to generate a new variable fruits_in_list to indicate whether a person ate one and more fruits in the fruits_list , but I don't know how to implement it in R.我想生成一个新的变量fruits_in_list来指示一个人是否吃了fruits_list中的一个或多个水果,但我不知道如何在 R 中实现它。

What I've done我做了什么

I checked some answers, but none of them are very relevant to my problem, like.我检查了一些答案,但没有一个与我的问题非常相关,比如。

  1. R Match character vectors R 匹配字符向量
  2. Compare two character vectors in R 比较 R 中的两个字符向量
  3. https://stackoverflow.com/search?q=How+to+fuzzy+match+two+character+vectors https://stackoverflow.com/search?q=How+to+fuzzy+match+two+character+vectors
  4. How to run through list of keyword vectors and fuzzy match them to a different file (R) 如何遍历关键字向量列表并将它们模糊匹配到不同的文件(R)
  5. Matching strings with abbreviations; 用缩写匹配字符串; fuzzy matching 模糊匹配

Reproducible code可重现的代码

fruits_Jack = c('XXappleYYY,lemon,orange,pitaya')
fruits_Rose = c('Navel orange,Blood orange,watermelon,cherry')
fruits_Biden= c('pitaya,cherry,banana')

fruits_list = c('apple', 'lemon', 'orange', 'watermelon', 'peach', 'pear')

df = 
  data.frame(id         = c('Jack', 'Rose', 'Biden'),
             fruits_eat = c(fruits_Jack, fruits_Rose, fruits_Biden))

> df
     id                                  fruits_eat
1  Jack                   apple,lemon,orange,pitaya
2  Rose Navel orange,Blood orange,watermelon,cherry
3 Biden                        pitaya,cherry,banana


Expect output期待 output

df_expect = cbind(df, fruits_in_list = c(1, 1, 0))

> df_expect
     id                                  fruits_eat fruits_in_list
1  Jack                   apple,lemon,orange,pitaya              1
2  Rose Navel orange,Blood orange,watermelon,cherry              1
3 Biden                        pitaya,cherry,banana              0

With stringr , use str_detect , or str_count if you want a real count:使用stringr ,使用str_detectstr_count如果你想要一个真正的计数:

library(stringr)
library(dplyr)
df %>% 
  mutate(fruits_in_list = +(str_detect(fruits_eat, paste0(fruits_list, collapse = "|"))),
         count = str_count(fruits_eat, paste0(fruits_list, collapse = "|")))
     id                                  fruits_eat fruits_in_list count
1  Jack              XXappleYYY,lemon,orange,pitaya              1     3
2  Rose Navel orange,Blood orange,watermelon,cherry              1     3
3 Biden                        pitaya,cherry,banana              0     0

A solution using data.table and fast if else fifelse() , as well as the base R function grepl() to do the matching.使用data.table和 fast if else fifelse()以及基础 R function grepl()进行匹配的解决方案。 The "l" on the end of grepl() stands for logical, and that means it will return a TRUE if the pattern is matched anywhere in the string given ( fruits_eat ), and a FALSE otherwise - this means it can be passed immediately to the test argument of the if else. grepl()末尾的“l”代表逻辑,这意味着如果模式与给定字符串 ( fruits_eat ) 中的任何位置匹配,它将返回TRUE ,否则返回FALSE - 这意味着它可以立即传递给if else 的测试参数。

The key point here is that you can paste strings "string1" and "string2" together separated by "|"这里的重点是可以将字符串"string1""string2"粘贴在一起,用"|"分隔, and "string1|string2" matches for "string1" or "string2" inside grepl() . , 并且"string1|string2"匹配grepl()中的"string1""string2"

library(data.table)
setDT(df)

df[, fruits_in_list := fifelse(grepl(paste0(fruits_list,
                                            collapse = "|"), fruits_eat),1,0)]
df
      id                                  fruits_eat fruits_in_list
1:  Jack              XXappleYYY,lemon,orange,pitaya              1
2:  Rose Navel orange,Blood orange,watermelon,cherry              1
3: Biden                        pitaya,cherry,banana              0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM