简体   繁体   English

用于匹配 Python 中包含撇号的确切单词的正则表达式?

[英]Regex for matching exact words that contain apostrophes in Python?

For the purpose of this project, I'm using more exact regex expressions, rather than more general ones.出于本项目的目的,我使用了更精确的正则表达式,而不是更通用的表达式。 I'm counting occurrences words from a list of words in a text file called I import into my script called vocabWords, where each word in the list is in the format \\bword\\b .我正在计算一个名为 I import 的文本文件中的单词列表中的单词出现次数,我将其导入名为 vocabWords 的脚本,其中列表中的每个单词的格式为\\bword\\b

When I run my script, \\bwhat\\b will pick up the words "what" and "what's", but \\bwhat's\\b will pick up no words.当我运行我的脚本时, \\bwhat\\b会选择单词“what”和“what's”,但\\bwhat's\\b不会选择单词。 If I switch the order so the apostrophe word is before the root word, words are counted correctly.如果我切换顺序使撇号词在词根词之前,则词数正确。 How can I change my regex list so the words are counted correctly?如何更改我的正则表达式列表以便正确计算单词? I understand the problem is using "\\b", but I haven't been able to find how to fix this.我知道问题是使用“\\b”,但我一直找不到如何解决这个问题。 I cannot have a more general regex, and I have to include the words themselves in the regex pattern.我不能有更通用的正则表达式,我必须在正则表达式模式中包含单词本身。

vocabWords:词条:

\bwhat\b
\bwhat's\b
\biron\b
\biron's\b

My code:我的代码:

matched = []
regex_all = re.compile('|'.join(vocabWords))
for row in df['test']:
    matched.append(re.findall(regex_all, row))

If you sort your wordlist by length before turning it into a regexp, longer words (like "what's") will precede shorter words (like "what").如果在将单词表转换为正则表达式之前按长度对其进行排序,则较长的单词(如“what's”)将位于较短的单词(如“what”)之前。 This should do the trick.这应该可以解决问题。

regex_all = re.compile('|'.join(sorted(vocabWords, key=len, reverse=True)))

There are at least another 2 solutions:至少还有另外两种解决方案:

  1. Test that next symbol isn't an apostrophe r"\\bwhat(?!')\\b"测试下一个符号不是撇号r"\\bwhat(?!')\\b"
  2. Use more general rule r"\\bwhat(?:'s)?\\b" to caught both variants with/without apostrophe.使用更通用的规则r"\\bwhat(?:'s)?\\b"来捕获带/不带撇号的两种变体。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM