简体   繁体   English

使用动态正则表达式匹配字符串中的整个单词

[英]Match a whole word in a string using dynamic regex

I am looking to see whether a word occurs in a sentence using regex.我想看看一个词是否出现在使用正则表达式的句子中。 Words are separated by spaces, but may have punctuation on either side.单词由空格分隔,但可以在任一侧使用标点符号。 If the word is in the middle of the string, the following match works (it prevents part-words from matching, allows punctuation on either side of the word).如果单词位于字符串的中间,则以下匹配有效(它阻止部分单词匹配,允许在单词的任一侧使用标点符号)。

match_middle_words = " [^a-zA-Z\d ]{0,}" + word + "[^a-zA-Z\d ]{0,} "

This won't however match the first or last word, since there is no trailing/leading space.然而,这不会匹配第一个或最后一个单词,因为没有尾随/前导空格。 So, for these cases, I have also been using:所以,对于这些情况,我也一直在使用:

match_starting_word = "^[^a-zA-Z\d]{0,}" + word + "[^a-zA-Z\d ]{0,} "
match_end_word = " [^a-zA-Z\d ]{0,}" + word + "[^a-zA-Z\d]{0,}$"

and then combining with然后结合

 match_string = match_middle_words  + "|" + match_starting_word  +"|" + match_end_word 

Is there a simple way to avoid the need of three match terms.有没有一种简单的方法可以避免需要三个匹配项。 Specifically, is there a way of specifying 'ether a space or the start of file (ie "^") and similar, 'either a space or the end of the file (ie "$")?具体来说,是否有一种方法可以指定“以空格或文件开头(即“^”)和类似的“空格或文件结尾(即“$”)?

Why not use a word boundary ?为什么不使用单词边界

match_string = r'\b' + word + r'\b'
match_string = r'\b{}\b'.format(word)
match_string = rf'\b{word}\b'          # Python 3.7+ required

If you have a list of words (say, in a words variable) to be matched as a whole word, use如果您有一个单词列表(例如,在一个words变量中)要作为一个完整的单词匹配,请使用

match_string = r'\b(?:{})\b'.format('|'.join(words))
match_string = rf'\b(?:{"|".join(words)})\b'         # Python 3.7+ required

In this case, you will make sure the word is only captured when it is surrounded by non-word characters.在这种情况下,您将确保仅当单词被非单词字符包围时才被捕获。 Also note that \\b matches at the string start and end.另请注意, \\b在字符串开始和结束处匹配。 So, no use adding 3 alternatives.因此,添加 3 个替代方案是没有用的。

Sample code :示例代码

import re
strn = "word hereword word, there word"
search = "word"
print re.findall(r"\b" + search + r"\b", strn)

And we found our 3 matches:我们找到了 3 个匹配项:

['word', 'word', 'word']

NOTE ON "WORD" BOUNDARIES关于“词”边界的注意事项

When the "words" are in fact chunks of any chars you should re.escape them before passing to the regex pattern:当“单词”实际上是任何字符的块时,您应该在传递给正则表达式模式之前重新re.escape它们:

match_string = r'\b{}\b'.format(re.escape(word)) # a single escaped "word" string passed
match_string = r'\b(?:{})\b'.format("|".join(map(re.escape, words))) # words list is escaped
match_string = rf'\b(?:{"|".join(map(re.escape, words))})\b' # Same as above for Python 3.7+

If the words to be matched as whole words may start/end with special characters, \\b won't work , use unambiguous word boundaries :如果要作为整个单词匹配的单词可能以特殊字符开头/结尾, \\b 将不起作用,请使用明确的单词边界

match_string = r'(?<!\w){}(?!\w)'.format(re.escape(word))
match_string = r'(?<!\w)(?:{})(?!\w)'.format("|".join(map(re.escape, words))) 

If the word boundaries are whitespace chars or start/end of string, use whitespace boundaries , (?<!\\S)...(?!\\S) :如果单词边界是空白字符或字符串的开头/结尾,请使用空白边界, (?<!\\S)...(?!\\S)

match_string = r'(?<!\S){}(?!\S)'.format(word)
match_string = r'(?<!\S)(?:{})(?!\S)'.format("|".join(map(re.escape, words))) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM