[英]Is there an R package (or existing function) for fuzzy string detection?
I'm looking to something similar to str_detect() from the stringr package, but which is capable of detecting imperfect or "fuzzy" matches.我正在寻找类似于 stringr 包中的 str_detect() 的东西,但它能够检测不完美或“模糊”的匹配。 Preferably, I'd like to be able to specify the degree of imperfection (1 different character, 2 different characters, etc.).
最好,我希望能够指定不完美的程度(1个不同的字符,2个不同的字符等)。
The matching I'm doing will take a form similar to the below code (but this is just a simplified example I made up).我正在做的匹配将采用类似于以下代码的形式(但这只是我编写的一个简化示例)。 In the example, only "RUTH CHRIS" gets matched - I'd like something capable of matching the slightly wrong strings as well.
在这个例子中,只有“RUTH CHRIS”被匹配——我想要一些能够匹配稍微错误的字符串的东西。
library(tidyverse)
my_restaurants <- tibble(restaurant = c("MCDOlNALD'S ON FRANKLIN ST",
"NEW JERSEY WENDYS",
"8/25/19 RUTH CHRIS",
"MELTINGPO 9823i3")
)
cheap <- c("MCDONALD'S", "WENDY'S") %>% str_c(collapse="|")
expensive <- c("RUTH CHRIS", "MELTING POT") %>% str_c(collapse="|")
my_restaurants %>%
mutate(category = case_when(
str_detect(restaurant, cheap) ~ "CHEAP",
str_detect(restaurant, expensive) ~ "EXPENSIVE"
))
So again, this gives this output:再一次,这给出了这个输出:
## A tibble: 4 × 2
# restaurant category
# <chr> <chr>
# 1 MCDOlNALD'S ON FRANKLIN ST NA
# 2 NEW JERSEY WENDYS NA
# 3 8/25/19 RUTH CHRIS EXPENSIVE
# 4 MELTINGPOT 9823i3 NA
But I want:但我想要:
## A tibble: 4 × 2
# restaurant category
# <chr> <chr>
# 1 MCDOlNALD'S ON FRANKLIN ST CHEAP
# 2 NEW JERSEY WENDYS CHEAP
# 3 8/25/19 RUTH CHRIS EXPENSIVE
# 4 MELTINGPOT 9823i3 EXPENSIVE
I'm not against using regex, but my actual data is significantly more complicated than the given example, so I'd prefer something much more concise that allows for general, not specifc, types of fuzziness.我不反对使用正则表达式,但我的实际数据比给定的示例要复杂得多,所以我更喜欢更简洁的东西,允许一般而不是特定的模糊类型。
In Base R, You could do:在 Base R 中,您可以执行以下操作:
cheap <- c("MCDONALD'S", "WENDY'S")
expensive <- c("RUTH CHRIS", "MELTING POT")
pat <- stack(list(cheap = cheap, expensive = expensive))
transform(my_restaurants, category=pat[sapply(pat$values,agrep,restaurant),2])
restaurant category
1 MCDOlNALD'S ON FRANKLIN ST cheap
2 NEW JERSEY WENDYS cheap
3 8/25/19 RUTH CHRIS expensive
4 MELTINGPO 9823i3 expensive
You can use fuzzyjoin::stringdist_left_join
您可以使用
fuzzyjoin::stringdist_left_join
cheap <- c("MCDONALD'S", "WENDY'S")
expensive <- c("RUTH CHRIS", "MELTING POT")
pat <- stack(list(cheap = cheap, expensive = expensive))
fuzzyjoin::stringdist_left_join(my_restaurants, pat,
c(restaurant='values'), max_dist=0.45, method = 'jaccard')
# A tibble: 4 x 3
restaurant values ind
<chr> <chr> <fct>
1 MCDOlNALD'S ON FRANKLIN ST MCDONALD'S cheap
2 NEW JERSEY WENDYS WENDY'S cheap
3 8/25/19 RUTH CHRIS RUTH CHRIS expensive
4 MELTINGPO 9823i3 MELTING POT expensive
The top response to this question clued me in to try agrepl()
, which seems to best suit my needs for this project since it is a straightforward substitute for str_detect()
.对这个问题的最高回答提示我尝试
agrepl()
,这似乎最适合我对这个项目的需求,因为它是str_detect()
的直接替代品。
Using my example from above...使用我上面的例子......
my_restaurants <- tibble(restaurant = c("MCDOlNALD'S ON FRANKLIN ST",
"NEW JERSEY WENDYS",
"8/25/19 RUTH CHRIS",
"MELTINGPO 9823i3")
)
cheap <- c("MCDONALD'S", "WENDY'S") %>% str_c(collapse="|")
expensive <- c("RUTH CHRIS", "MELTING POT") %>% str_c(collapse="|")
my_restaurants %>%
mutate(category = case_when(
agrepl(cheap, restaurant, 2, fixed=FALSE) ~ "CHEAP",
agrepl(expensive, restaurant, 2, fixed=FALSE) ~ "EXPENSIVE"
))
Gives the output:给出输出:
# A tibble: 4 × 2
restaurant category
<chr> <chr>
1 MCDOlNALD'S ON FRANKLIN ST CHEAP
2 NEW JERSEY WENDYS CHEAP
3 8/25/19 RUTH CHRIS EXPENSIVE
4 MELTINGPO 9823i3 EXPENSIVE
However, onyambu's solutions also seem to be good alternative methods.然而,onyambu 的解决方案似乎也是不错的替代方法。 They allow for more advanced forms of fuzzy matching than
agrepl()
is capable of.它们允许比
agrepl()
更高级的模糊匹配形式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.