使用动态正则表达式匹配字符串中的整个单词

Question

我想看看一个词是否出现在使用正则表达式的句子中。 单词由空格分隔，但可以在任一侧使用标点符号。 如果单词位于字符串的中间，则以下匹配有效（它阻止部分单词匹配，允许在单词的任一侧使用标点符号）。

match_middle_words = " [^a-zA-Z\d ]{0,}" + word + "[^a-zA-Z\d ]{0,} "

然而，这不会匹配第一个或最后一个单词，因为没有尾随/前导空格。 所以，对于这些情况，我也一直在使用：

match_starting_word = "^[^a-zA-Z\d]{0,}" + word + "[^a-zA-Z\d ]{0,} "
match_end_word = " [^a-zA-Z\d ]{0,}" + word + "[^a-zA-Z\d]{0,}$"

然后结合

 match_string = match_middle_words  + "|" + match_starting_word  +"|" + match_end_word

有没有一种简单的方法可以避免需要三个匹配项。 具体来说，是否有一种方法可以指定“以空格或文件开头（即“^”）和类似的“空格或文件结尾（即“$”）？

Answer 1

为什么不使用单词边界？

match_string = r'\b' + word + r'\b'
match_string = r'\b{}\b'.format(word)
match_string = rf'\b{word}\b'          # Python 3.7+ required

如果您有一个单词列表（例如，在一个words变量中）要作为一个完整的单词匹配，请使用

match_string = r'\b(?:{})\b'.format('|'.join(words))
match_string = rf'\b(?:{"|".join(words)})\b'         # Python 3.7+ required

在这种情况下，您将确保仅当单词被非单词字符包围时才被捕获。 另请注意， \\b在字符串开始和结束处匹配。 因此，添加 3 个替代方案是没有用的。

示例代码：

import re
strn = "word hereword word, there word"
search = "word"
print re.findall(r"\b" + search + r"\b", strn)

我们找到了 3 个匹配项：

['word', 'word', 'word']

关于“词”边界的注意事项

当“单词”实际上是任何字符的块时，您应该在传递给正则表达式模式之前重新re.escape它们：

match_string = r'\b{}\b'.format(re.escape(word)) # a single escaped "word" string passed
match_string = r'\b(?:{})\b'.format("|".join(map(re.escape, words))) # words list is escaped
match_string = rf'\b(?:{"|".join(map(re.escape, words))})\b' # Same as above for Python 3.7+

如果要作为整个单词匹配的单词可能以特殊字符开头/结尾， \\b 将不起作用，请使用明确的单词边界：

match_string = r'(?<!\w){}(?!\w)'.format(re.escape(word))
match_string = r'(?<!\w)(?:{})(?!\w)'.format("|".join(map(re.escape, words)))

如果单词边界是空白字符或字符串的开头/结尾，请使用空白边界, (?<!\\S)...(?!\\S) ：

match_string = r'(?<!\S){}(?!\S)'.format(word)
match_string = r'(?<!\S)(?:{})(?!\S)'.format("|".join(map(re.escape, words)))

使用动态正则表达式匹配字符串中的整个单词

问题描述

1 个解决方案

解决方案1
14 已采纳 2015-05-01 22:30:20

使用动态正则表达式匹配字符串中的整个单词

问题描述

1 个解决方案

解决方案1 14 已采纳 2015-05-01 22:30:20

解决方案1
14 已采纳 2015-05-01 22:30:20