简体   繁体   English

与 R 中的 max.distance 相关的意外 agrep() 结果

[英]unexpected agrep() results related to max.distance in R

EDIT: This bug was found in 32-bit versions of R was fixed in R version 2.9.2.编辑:这个错误是在 32 位版本的 R 中发现的,已在 R 版本 2.9.2 中得到修复。


This was tweeted to me by @leoniedu today and I don't have an answer for him so I thought I would post it here.这是@leoniedu 今天发给我的推特,我没有他的答案,所以我想我会把它贴在这里。

I have read the documentation for agrep() (fuzzy string matching) and it appears that I don't fully understand the max.distance parameter.我已经阅读了 agrep() (模糊字符串匹配)的文档,看来我并不完全理解 max.distance 参数。 Here's an example:这是一个例子:

pattern <- "Staatssekretar im Bundeskanzleramt"
x <- "Bundeskanzleramt"
agrep(pattern,x,max.distance=18) 
agrep(pattern,x,max.distance=19)

That behaves exactly like I would expect.这完全符合我的预期。 There are 18 characters different between the strings so I would expect that to be the threshold of a match.字符串之间有 18 个字符不同,所以我希望这是匹配的阈值。 Here's what's confusing me:这让我感到困惑:

agrep(pattern,x,max.distance=30) 
agrep(pattern,x,max.distance=31)
agrep(pattern,x,max.distance=32) 
agrep(pattern,x,max.distance=33)

Why are 30 and 33 matches, but not 31 and 32?为什么是 30 和 33 匹配,而不是 31 和 32? To save you some counting,为了节省你一些计数,

> nchar("Staatssekretar im Bundeskanzleramt")
[1] 34
> nchar("Bundeskanzleramt")
[1] 16

I posted this on the R list a while back and reported as a bug in R-bugs-list.我不久前将其发布在 R 列表中,并在 R-bugs-list 中报告为错误。 I had no useful responses, so I twitted to see if the bug was reproducible or I was just missing something.我没有得到有用的回复,所以我发推特看看这个错误是否可以重现,或者我只是遗漏了什么。 JD Long was able to reproduce it and kindly posted the question here. JD Long 能够重现它,并在此贴出了问题。

Note that, at least in R, then, agrep is a misnomer since it does not matches regular expressions, while grep stands for "Globally search for the Regular Expression and Print".请注意,至少在 R 中,agrep 是用词不当,因为它匹配正则表达式,而 grep 代表“全局搜索正则表达式并打印”。 It shouldn't have a problem with patterns longer than the target vector.比目标向量长的模式应该没有问题。 (i think!) (我认为!)

In my linux server, all is well but not so in my Mac and Windows machines.在我的 Linux 服务器上,一切都很好,但在我的 Mac 和 Windows 机器上却不是这样。

Mac: sessionInfo() R version 2.9.1 (2009-06-26) i386-apple-darwin8.11.1 locale: en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 Mac:sessionInfo() R 版本 2.9.1 (2009-06-26) i386-apple-darwin8.11.1 语言环境:en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US。 UTF-8编码

agrep(pattern,x,max.distance=30) [1] 1 agrep(pattern,x,max.distance=30) [1] 1

agrep(pattern,x,max.distance=31) integer(0) agrep(pattern,x,max.distance=32) integer(0) agrep(pattern,x,max.distance=33) [1] 1 agrep(pattern,x,max.distance=31) integer(0) agrep(pattern,x,max.distance=32) integer(0) agrep(pattern,x,max.distance=33) [1] 1

Linux: R version 2.9.1 (2009-06-26) x86_64-unknown-linux-gnu Linux:R 版本 2.9.1 (2009-06-26) x86_64-unknown-linux-gnu

locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C语言环境:LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8; LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

agrep(pattern,x,max.distance=30) [1] 1 agrep(pattern,x,max.distance=31) [1] 1 agrep(pattern,x,max.distance=32) [1] 1 agrep(pattern,x,max.distance=33) [1] 1 agrep(模式,x,max.distance=30) [1] 1 agrep(模式,x,max.distance=31) [1] 1 agrep(模式,x,max.distance=32) [1] 1 agrep( pattern,x,max.distance=33) [1] 1

I am not sure if your example makes sense.我不确定你的例子是否有意义。 For the basic grep(), pattern is often a simple or a regular expression, and x is a vector whose element get matched to pattern.对于基本的 grep(),pattern 通常是一个简单的或正则表达式,而 x 是一个向量,其元素与 pattern 匹配。 Having pattern as longer string that x strikes me as odd.将模式设置为更长的字符串 x 让我觉得很奇怪。

Consider this where we just use grep instead of substr:考虑一下我们只使用 grep 而不是 substr 的地方:

R> grep("vo", c("foo","bar","baz"))   # vo is not in the vector
integer(0)
R> agrep("vo", c("foo","bar","baz"), value=TRUE) # but is close enough to foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.25) # still foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.75) # now all match
[1] "foo" "bar" "baz"
R>  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM