[英]How to fuzzy match two character vectors in r
I have a df
,where the id
refers to a different person and the fruits_eat
refers to the fruit that person eats.我有一个
df
,其中id
指的是另一个人,而fruits_eat
指的是那个人吃的水果。 Also, I have a vector fruits_list
storing a list of fruits.另外,我有一个向量
fruits_list
存储水果列表。
I want to generate a new variable fruits_in_list
to indicate whether a person ate one and more fruits in the fruits_list
, but I don't know how to implement it in R.我想生成一个新的变量
fruits_in_list
来指示一个人是否吃了fruits_list
中的一个或多个水果,但我不知道如何在 R 中实现它。
I checked some answers, but none of them are very relevant to my problem, like.我检查了一些答案,但没有一个与我的问题非常相关,比如。
fruits_Jack = c('XXappleYYY,lemon,orange,pitaya')
fruits_Rose = c('Navel orange,Blood orange,watermelon,cherry')
fruits_Biden= c('pitaya,cherry,banana')
fruits_list = c('apple', 'lemon', 'orange', 'watermelon', 'peach', 'pear')
df =
data.frame(id = c('Jack', 'Rose', 'Biden'),
fruits_eat = c(fruits_Jack, fruits_Rose, fruits_Biden))
> df
id fruits_eat
1 Jack apple,lemon,orange,pitaya
2 Rose Navel orange,Blood orange,watermelon,cherry
3 Biden pitaya,cherry,banana
df_expect = cbind(df, fruits_in_list = c(1, 1, 0))
> df_expect
id fruits_eat fruits_in_list
1 Jack apple,lemon,orange,pitaya 1
2 Rose Navel orange,Blood orange,watermelon,cherry 1
3 Biden pitaya,cherry,banana 0
With stringr
, use str_detect
, or str_count
if you want a real count:使用
stringr
,使用str_detect
或str_count
如果你想要一个真正的计数:
library(stringr)
library(dplyr)
df %>%
mutate(fruits_in_list = +(str_detect(fruits_eat, paste0(fruits_list, collapse = "|"))),
count = str_count(fruits_eat, paste0(fruits_list, collapse = "|")))
id fruits_eat fruits_in_list count
1 Jack XXappleYYY,lemon,orange,pitaya 1 3
2 Rose Navel orange,Blood orange,watermelon,cherry 1 3
3 Biden pitaya,cherry,banana 0 0
A solution using data.table
and fast if else fifelse()
, as well as the base R function grepl()
to do the matching.使用
data.table
和 fast if else fifelse()
以及基础 R function grepl()
进行匹配的解决方案。 The "l" on the end of grepl()
stands for logical, and that means it will return a TRUE
if the pattern is matched anywhere in the string given ( fruits_eat
), and a FALSE
otherwise - this means it can be passed immediately to the test argument of the if else. grepl()
末尾的“l”代表逻辑,这意味着如果模式与给定字符串 ( fruits_eat
) 中的任何位置匹配,它将返回TRUE
,否则返回FALSE
- 这意味着它可以立即传递给if else 的测试参数。
The key point here is that you can paste strings "string1"
and "string2"
together separated by "|"
这里的重点是可以将字符串
"string1"
和"string2"
粘贴在一起,用"|"
分隔, and "string1|string2"
matches for "string1"
or "string2"
inside grepl()
. , 并且
"string1|string2"
匹配grepl()
中的"string1"
或"string2"
。
library(data.table)
setDT(df)
df[, fruits_in_list := fifelse(grepl(paste0(fruits_list,
collapse = "|"), fruits_eat),1,0)]
df
id fruits_eat fruits_in_list
1: Jack XXappleYYY,lemon,orange,pitaya 1
2: Rose Navel orange,Blood orange,watermelon,cherry 1
3: Biden pitaya,cherry,banana 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.