模糊字符串匹配和正则表达式

Question

I have a vector of sentences such as: 我有一个句子矢量，如：

example <- c("text text word1 text text word2 text text", ...)

and I'm trying to identify which sentences comply with the following rules: 我正在尝试确定哪些句子符合以下规则：

the sentence contains both "word1" and "word2" 该句子包含“word1”和“word2”
"word1" comes before "word2" “word1”出现在“word2”之前
there are between zero and three words between "word1" and "word2" “word1”和“word2”之间有0到3个字

This could be done with a normal regex. 这可以通过正常的正则表达式来完成。 However, the problem is that "word1" or "word2" can contain typos (I am expecting at most a distance of 3 for both words together). 然而，问题是“word1”或“word2”可能包含拼写错误（对于两个单词，我预计最多距离为3）。 Examples of typos could be "wrod1", "woord2", "wrd1", etc. I also want to match the sentences that contain typos for these words within the distance constraint. 拼写错误的例子可以是“wrod1”，“woord2”，“wrd1”等。我还想在距离约束内匹配包含这些单词的拼写错误的句子。 Therefore I was trying to use agrepl : 因此我试图使用agrepl ：

agrepl("(?:.*?)\\bword1\\b(?:\\s(?:\\w+\\s){0,3})\\bword2\\b(?:.*?)", example, fixed=FALSE, max=3)

However, I believe that the distance is being calculated with the whole sentence and not only with "word1" and "word2", and therefore I will almost never get any matches in this way. 但是，我相信距离是用整个句子计算的，而不仅仅是“word1”和“word2”，因此我几乎不会以这种方式得到任何匹配。 Any suggestions on how to fix this, or is agrepl/regex not the best tool for this problem? 有关如何解决此问题的任何建议，还是agrepl / regex不是解决此问题的最佳工具？

Answer 1

This fit for your rules, however I'm not sure what would your typos looks like. 这适合你的规则，但我不确定你的拼写错误是什么样的。 If you could show some example, it would be great. 如果你能展示一些例子，那就太好了。

^(?=.*word1\\s+(?:\\S+\\s+){0,3}word2.*$).* DEMO ^(?=.*word1\\s+(?:\\S+\\s+){0,3}word2.*$).* DEMO

模糊字符串匹配和正则表达式

问题描述

1 个解决方案

解决方案1
2 2016-02-17 11:26:46

模糊字符串匹配和正则表达式

问题描述

1 个解决方案

解决方案1 2 2016-02-17 11:26:46

解决方案1
2 2016-02-17 11:26:46