与 R 中的 max.distance 相关的意外 agrep() 结果

Question

EDIT: This bug was found in 32-bit versions of R was fixed in R version 2.9.2.编辑：这个错误是在 32 位版本的 R 中发现的，已在 R 版本 2.9.2 中得到修复。

This was tweeted to me by @leoniedu today and I don't have an answer for him so I thought I would post it here.这是@leoniedu 今天发给我的推特，我没有他的答案，所以我想我会把它贴在这里。

I have read the documentation for agrep() (fuzzy string matching) and it appears that I don't fully understand the max.distance parameter.我已经阅读了 agrep() （模糊字符串匹配）的文档，看来我并不完全理解 max.distance 参数。 Here's an example:这是一个例子：

pattern <- "Staatssekretar im Bundeskanzleramt"
x <- "Bundeskanzleramt"
agrep(pattern,x,max.distance=18) 
agrep(pattern,x,max.distance=19)

That behaves exactly like I would expect.这完全符合我的预期。 There are 18 characters different between the strings so I would expect that to be the threshold of a match.字符串之间有 18 个字符不同，所以我希望这是匹配的阈值。 Here's what's confusing me:这让我感到困惑：

agrep(pattern,x,max.distance=30) 
agrep(pattern,x,max.distance=31)
agrep(pattern,x,max.distance=32) 
agrep(pattern,x,max.distance=33)

Why are 30 and 33 matches, but not 31 and 32?为什么是 30 和 33 匹配，而不是 31 和 32？ To save you some counting,为了节省你一些计数，

> nchar("Staatssekretar im Bundeskanzleramt")
[1] 34
> nchar("Bundeskanzleramt")
[1] 16

Answer 1

I posted this on the R list a while back and reported as a bug in R-bugs-list.我不久前将其发布在 R 列表中，并在 R-bugs-list 中报告为错误。 I had no useful responses, so I twitted to see if the bug was reproducible or I was just missing something.我没有得到有用的回复，所以我发推特看看这个错误是否可以重现，或者我只是遗漏了什么。 JD Long was able to reproduce it and kindly posted the question here. JD Long 能够重现它，并在此贴出了问题。

Note that, at least in R, then, agrep is a misnomer since it does not matches regular expressions, while grep stands for "Globally search for the Regular Expression and Print".请注意，至少在 R 中，agrep 是用词不当，因为它不匹配正则表达式，而 grep 代表“全局搜索正则表达式并打印”。 It shouldn't have a problem with patterns longer than the target vector.比目标向量长的模式应该没有问题。 (i think!) （我认为！）

In my linux server, all is well but not so in my Mac and Windows machines.在我的 Linux 服务器上，一切都很好，但在我的 Mac 和 Windows 机器上却不是这样。

Mac: sessionInfo() R version 2.9.1 (2009-06-26) i386-apple-darwin8.11.1 locale: en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 Mac：sessionInfo() R 版本 2.9.1 (2009-06-26) i386-apple-darwin8.11.1 语言环境：en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US。 UTF-8编码

agrep(pattern,x,max.distance=30) [1] 1 agrep(pattern,x,max.distance=30) [1] 1

agrep(pattern,x,max.distance=31) integer(0) agrep(pattern,x,max.distance=32) integer(0) agrep(pattern,x,max.distance=33) [1] 1 agrep(pattern,x,max.distance=31) integer(0) agrep(pattern,x,max.distance=32) integer(0) agrep(pattern,x,max.distance=33) [1] 1

Linux: R version 2.9.1 (2009-06-26) x86_64-unknown-linux-gnu Linux：R 版本 2.9.1 (2009-06-26) x86_64-unknown-linux-gnu

locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C语言环境：LC_CTYPE=en_US.UTF-8；LC_NUMERIC=C；LC_TIME=en_US.UTF-8；LC_COLLATE=en_US.UTF-8；LC_MONETARY=C；LC_MESSAGES=en_US.UTF-8；LC_PAPER=en_US.UTF-8； LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

agrep(pattern,x,max.distance=30) [1] 1 agrep(pattern,x,max.distance=31) [1] 1 agrep(pattern,x,max.distance=32) [1] 1 agrep(pattern,x,max.distance=33) [1] 1 agrep(模式,x,max.distance=30) [1] 1 agrep(模式,x,max.distance=31) [1] 1 agrep(模式,x,max.distance=32) [1] 1 agrep( pattern,x,max.distance=33) [1] 1

Answer 2

I am not sure if your example makes sense.我不确定你的例子是否有意义。 For the basic grep(), pattern is often a simple or a regular expression, and x is a vector whose element get matched to pattern.对于基本的 grep()，pattern 通常是一个简单的或正则表达式，而 x 是一个向量，其元素与 pattern 匹配。 Having pattern as longer string that x strikes me as odd.将模式设置为更长的字符串 x 让我觉得很奇怪。

Consider this where we just use grep instead of substr:考虑一下我们只使用 grep 而不是 substr 的地方：

R> grep("vo", c("foo","bar","baz"))   # vo is not in the vector
integer(0)
R> agrep("vo", c("foo","bar","baz"), value=TRUE) # but is close enough to foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.25) # still foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.75) # now all match
[1] "foo" "bar" "baz"
R>

与 R 中的 max.distance 相关的意外 agrep() 结果

问题描述

2 个解决方案

解决方案1
2 已采纳 2009-07-25 22:32:32

解决方案2
0 2009-07-25 21:56:08

与 R 中的 max.distance 相关的意外 agrep() 结果

问题描述

2 个解决方案

解决方案1 2 已采纳 2009-07-25 22:32:32

解决方案2 0 2009-07-25 21:56:08

解决方案1
2 已采纳 2009-07-25 22:32:32

解决方案2
0 2009-07-25 21:56:08