简体   繁体   English

了解 R 中的 agrep 模糊匹配中的约束

[英]Understanding constraints in agrep fuzzy matching in R

This seems really simple but for some reason, I don't understand the behavior of agrep fuzzy matching involving substitutions.这看起来很简单,但由于某种原因,我不理解涉及替换的agrep模糊匹配的行为。 Two substitutions produce a match as expected when all=2 is specified, but not when substitutions=2 .当指定all=2时,两个替换按预期产生匹配,但在指定substitutions=2时不会。 Why is this?为什么是这样?

# Finds a match as expected
agrep("abcdeX", "abcdef", value = T,
      max.distance = list(sub=1, ins=0, del=0))
#> [1] "abcdef"


# Doesn't find a match as expected
agrep("abcdXX", "abcdef", value = T,
      max.distance = list(sub=1, ins=0, del=0))
#> character(0)


# Finds a match as expected
agrep("abcdXX", "abcdef", value = T,
      max.distance = list(all=2))
#> [1] "abcdef"
      

# Doesn't find a match UNEXPECTEDLY
agrep("abcdXX", "abcdef", value = T,
      max.distance = list(sub=2, ins=0, del=0))
#> character(0)

Created on 2021-06-03 by the reprex package (v2.0.0)reprex package (v2.0.0) 于 2021 年 6 月 3 日创建

all is an upper limit which always applies, regardless of other max.distance controls (other than cost ). all是始终适用的上限,无论其他max.distance控件(除了cost )。 It defaults to 10%.默认为 10%。

# one characters can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
     max.distance = list(sub = 2, ins = 0, del = 0, all = 0.1))
# character(0)

# two characters can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
     max.distance = list(sub = 2, ins = 0, del = 0, all = 0.2))
# [1] "abcdef"

# one character can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
    max.distance = list(sub = 1, ins = 1, del = 0, all = 0.1))
# character(0)

# two characters can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
    max.distance = list(sub = 1, ins = 1, del = 0, all = 0.2))
# [1] "abcdef"

There's a bit of a gotcha that the fractional mode of setting all switches to the integer mode at 1.设置all的分数模式在 1 处切换到 integer 模式有一点问题。

# 8 insertions allowed
agrep(pattern = "abcdXXef", x = "abcdef", value = TRUE,
    max.distance = list(sub = 0, ins = 2, del = 0, all = 1 - 1e-9))
# [1] "abcdef"

# 1 insertion allowed
agrep(pattern = "abcdXXef", x = "abcdef", value = TRUE,
    max.distance = list(sub = 0, ins = 2, del = 0, all = 1))
# character(0)

When you suppress all by setting it to just less than 1, the limits on the distance mode apply.当您通过将其设置为小于 1 来抑制all时,将应用距离模式的限制。

# two substitutions allowed
agrep(pattern = "abcdXX", 
    x = c("abcdef", "abcXdef", "abcefg"), value = TRUE,
    max.distance = list(sub = 2, ins = 0, del = 0, all = 1 - 1e-9))
# [1] "abcdef"

The purpose of setting the cost is to allow you to move around the mutation-space at different rates in different directions.设置成本的目的是允许您以不同的速率在不同的方向上在突变空间中移动。 This is going to depend on your use case.这将取决于您的用例。 For example some language dialects might be more likely to add letters.例如,某些语言方言可能更可能添加字母。 You might chose to let a deletion cost two insertions.您可能会选择让删除花费两次插入。 By default, all are equally weighted when costs = NULL , ie costs = c(ins = 1, del = 1, sub = 1) .默认情况下,当costs = NULL时,所有的权重相同,即costs = c(ins = 1, del = 1, sub = 1)

EDIT: regarding your comment about why some patterns match and others don't, the 10% refers to the number of characters in the pattern, rounding up .编辑:关于您关于为什么某些模式匹配而其他模式不匹配的评论,10% 是指模式中的字符数,向上取整

agrep(pattern = "01234567XX89", x = "0123456789", value = TRUE, 
    max.distance = list(sub = 0, ins = 2, del = 0))
# [1] "0123456789"
agrep(pattern = "01234567XX", x = "0123456789", value = TRUE, 
    max.distance = list(sub = 2, ins = 0, del = 0))
# character(0)
num_mutations <- nchar(c("01234567XX89", "01234567XX")) * 0.1
num_mutations
# [1] 1.2 1.0
ceiling(num_mutations)
[1] 2 1

The second pattern is only 10 characters, so only one substitution is allowed.第二个模式只有 10 个字符,所以只允许替换一个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM