简体   繁体   中英

Fuzzy, but not too fuzzy string matching with agrep

I have a string like this:

text <- c("Car", "Ca-R", "My Car", "I drive cars", "Chars", "CanCan")

I would like to match a pattern so it is only matched once and with max. one substitution/insertion. the result should look like this:

> "Car"

I tried the following to match my pattern only once with max. substitution/insertion etc and get the following:

> agrep("ca?", text, ignore.case = T, max = list(substitutions = 1, insertions = 1, deletions = 1, all = 1), value = T)
[1] "Car"          "Ca-R"         "My Car"       "I drive cars" "CanCan"  

Is there a way to exclude the strings which are n-characters longer than my pattern?

An alternative which replaces agrep with adist :

text[which(adist("ca?", text, ignore.case=TRUE) <= 1)]

adist gives the number of insertions/deletions/substitutions required to convert one string to another, so keeping only elements with an adist of equal to or less than one should give you what you want, I think.

This answer is probably less appropriate if you really want to exclude things "n-characters longer" than the pattern (with n being variable), rather than just match whole words (where n is always 1 in your example).

You can use nchar to limit the strings based on their length:

pattern <- "ca?"
matches <- agrep(pattern, text, ignore.case = T, max = list(substitutions = 1, insertions = 1, deletions = 1, all = 1), value = T)
n <- 4
matches[nchar(matches) < n+nchar(pattern)]
# [1] "Car"    "Ca-R"   "My Car" "CanCan"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM