[英]Fuzzy string matching and regex
I have a vector of sentences such as: 我有一个句子矢量,如:
example <- c("text text word1 text text word2 text text", ...)
and I'm trying to identify which sentences comply with the following rules: 我正在尝试确定哪些句子符合以下规则:
This could be done with a normal regex. 这可以通过正常的正则表达式来完成。 However, the problem is that "word1" or "word2" can contain typos (I am expecting at most a distance of 3 for both words together). 然而,问题是“word1”或“word2”可能包含拼写错误(对于两个单词,我预计最多距离为3)。 Examples of typos could be "wrod1", "woord2", "wrd1", etc. I also want to match the sentences that contain typos for these words within the distance constraint. 拼写错误的例子可以是“wrod1”,“woord2”,“wrd1”等。我还想在距离约束内匹配包含这些单词的拼写错误的句子。 Therefore I was trying to use agrepl
: 因此我试图使用agrepl
:
agrepl("(?:.*?)\\bword1\\b(?:\\s(?:\\w+\\s){0,3})\\bword2\\b(?:.*?)", example, fixed=FALSE, max=3)
However, I believe that the distance is being calculated with the whole sentence and not only with "word1" and "word2", and therefore I will almost never get any matches in this way. 但是,我相信距离是用整个句子计算的,而不仅仅是“word1”和“word2”,因此我几乎不会以这种方式得到任何匹配。 Any suggestions on how to fix this, or is agrepl/regex not the best tool for this problem? 有关如何解决此问题的任何建议,还是agrepl / regex不是解决此问题的最佳工具?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.