简体   繁体   English

模糊字符串匹配和正则表达式

[英]Fuzzy string matching and regex

I have a vector of sentences such as: 我有一个句子矢量,如:

example <- c("text text word1 text text word2 text text", ...)

and I'm trying to identify which sentences comply with the following rules: 我正在尝试确定哪些句子符合以下规则:

  • the sentence contains both "word1" and "word2" 该句子包含“word1”和“word2”
  • "word1" comes before "word2" “word1”出现在“word2”之前
  • there are between zero and three words between "word1" and "word2" “word1”和“word2”之间有0到3个字

This could be done with a normal regex. 这可以通过正常的正则表达式来完成。 However, the problem is that "word1" or "word2" can contain typos (I am expecting at most a distance of 3 for both words together). 然而,问题是“word1”或“word2”可能包含拼写错误(对于两个单词,我预计最多距离为3)。 Examples of typos could be "wrod1", "woord2", "wrd1", etc. I also want to match the sentences that contain typos for these words within the distance constraint. 拼写错误的例子可以是“wrod1”,“woord2”,“wrd1”等。我还想在距离约束内匹配包含这些单词的拼写错误的句子。 Therefore I was trying to use agrepl : 因此我试图使用agrepl

agrepl("(?:.*?)\\bword1\\b(?:\\s(?:\\w+\\s){0,3})\\bword2\\b(?:.*?)", example, fixed=FALSE, max=3)

However, I believe that the distance is being calculated with the whole sentence and not only with "word1" and "word2", and therefore I will almost never get any matches in this way. 但是,我相信距离是用整个句子计算的,而不仅仅是“word1”和“word2”,因此我几乎不会以这种方式得到任何匹配。 Any suggestions on how to fix this, or is agrepl/regex not the best tool for this problem? 有关如何解决此问题的任何建议,还是agrepl / regex不是解决此问题的最佳工具?

This fit for your rules, however I'm not sure what would your typos looks like. 这适合你的规则,但我不确定你的拼写错误是什么样的。 If you could show some example, it would be great. 如果你能展示一些例子,那就太好了。

^(?=.*word1\\s+(?:\\S+\\s+){0,3}word2.*$).* DEMO ^(?=.*word1\\s+(?:\\S+\\s+){0,3}word2.*$).* DEMO

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM