简体   繁体   English

是否有用于模糊字符串检测的 R 包(或现有函数)?

[英]Is there an R package (or existing function) for fuzzy string detection?

I'm looking to something similar to str_detect() from the stringr package, but which is capable of detecting imperfect or "fuzzy" matches.我正在寻找类似于 stringr 包中的 str_detect() 的东西,但它能够检测不完美或“模糊”的匹配。 Preferably, I'd like to be able to specify the degree of imperfection (1 different character, 2 different characters, etc.).最好,我希望能够指定不完美的程度(1个不同的字符,2个不同的字符等)。

The matching I'm doing will take a form similar to the below code (but this is just a simplified example I made up).我正在做的匹配将采用类似于以下代码的形式(但这只是我编写的一个简化示例)。 In the example, only "RUTH CHRIS" gets matched - I'd like something capable of matching the slightly wrong strings as well.在这个例子中,只有“RUTH CHRIS”被匹配——我想要一些能够匹配稍微错误的字符串的东西。

library(tidyverse)

my_restaurants <- tibble(restaurant = c("MCDOlNALD'S ON FRANKLIN ST",
                                        "NEW JERSEY WENDYS",
                                        "8/25/19 RUTH CHRIS",
                                        "MELTINGPO 9823i3")
)

cheap <- c("MCDONALD'S", "WENDY'S") %>% str_c(collapse="|")
expensive <- c("RUTH CHRIS", "MELTING POT") %>% str_c(collapse="|")

my_restaurants %>%
  mutate(category = case_when(
    str_detect(restaurant, cheap) ~ "CHEAP",
    str_detect(restaurant, expensive) ~ "EXPENSIVE"
    )) 

So again, this gives this output:再一次,这给出了这个输出:

##  A tibble: 4 × 2
#   restaurant                 category 
#   <chr>                      <chr>    
# 1 MCDOlNALD'S ON FRANKLIN ST NA       
# 2 NEW JERSEY WENDYS          NA       
# 3 8/25/19 RUTH CHRIS         EXPENSIVE
# 4 MELTINGPOT 9823i3          NA 

But I want:但我想要:

## A tibble: 4 × 2
#   restaurant                 category 
#   <chr>                      <chr>    
# 1 MCDOlNALD'S ON FRANKLIN ST CHEAP       
# 2 NEW JERSEY WENDYS          CHEAP       
# 3 8/25/19 RUTH CHRIS         EXPENSIVE
# 4 MELTINGPOT 9823i3          EXPENSIVE 

I'm not against using regex, but my actual data is significantly more complicated than the given example, so I'd prefer something much more concise that allows for general, not specifc, types of fuzziness.我不反对使用正则表达式,但我的实际数据比给定的示例要复杂得多,所以我更喜欢更简洁的东西,允许一般而不是特定的模糊类型。

In Base R, You could do:在 Base R 中,您可以执行以下操作:

cheap <- c("MCDONALD'S", "WENDY'S") 
expensive <- c("RUTH CHRIS", "MELTING POT")

pat <- stack(list(cheap = cheap, expensive = expensive))

transform(my_restaurants, category=pat[sapply(pat$values,agrep,restaurant),2])

                  restaurant  category
1 MCDOlNALD'S ON FRANKLIN ST     cheap
2          NEW JERSEY WENDYS     cheap
3         8/25/19 RUTH CHRIS expensive
4           MELTINGPO 9823i3 expensive

You can use fuzzyjoin::stringdist_left_join您可以使用fuzzyjoin::stringdist_left_join

cheap <- c("MCDONALD'S", "WENDY'S") 
expensive <- c("RUTH CHRIS", "MELTING POT")

pat <- stack(list(cheap = cheap, expensive = expensive))

fuzzyjoin::stringdist_left_join(my_restaurants, pat, 
      c(restaurant='values'), max_dist=0.45, method = 'jaccard')

# A tibble: 4 x 3
  restaurant                 values      ind      
  <chr>                      <chr>       <fct>    
1 MCDOlNALD'S ON FRANKLIN ST MCDONALD'S  cheap    
2 NEW JERSEY WENDYS          WENDY'S     cheap    
3 8/25/19 RUTH CHRIS         RUTH CHRIS  expensive
4 MELTINGPO 9823i3           MELTING POT expensive

The top response to this question clued me in to try agrepl() , which seems to best suit my needs for this project since it is a straightforward substitute for str_detect() .这个问题的最高回答提示我尝试agrepl() ,这似乎最适合我对这个项目的需求,因为它是str_detect()的直接替代品。

Using my example from above...使用我上面的例子......

my_restaurants <- tibble(restaurant = c("MCDOlNALD'S ON FRANKLIN ST",
                                        "NEW JERSEY WENDYS",
                                        "8/25/19 RUTH CHRIS",
                                        "MELTINGPO 9823i3")
)

cheap <- c("MCDONALD'S", "WENDY'S") %>% str_c(collapse="|")
expensive <- c("RUTH CHRIS", "MELTING POT") %>% str_c(collapse="|")

my_restaurants %>%
  mutate(category = case_when(
    agrepl(cheap, restaurant, 2, fixed=FALSE) ~ "CHEAP",
    agrepl(expensive, restaurant, 2, fixed=FALSE) ~ "EXPENSIVE"
  ))

Gives the output:给出输出:

# A tibble: 4 × 2
  restaurant                 category 
  <chr>                      <chr>    
1 MCDOlNALD'S ON FRANKLIN ST CHEAP    
2 NEW JERSEY WENDYS          CHEAP    
3 8/25/19 RUTH CHRIS         EXPENSIVE
4 MELTINGPO 9823i3           EXPENSIVE

However, onyambu's solutions also seem to be good alternative methods.然而,onyambu 的解决方案似乎也是不错的替代方法。 They allow for more advanced forms of fuzzy matching than agrepl() is capable of.它们允许比agrepl()更高级的模糊匹配形式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM