简体   繁体   English

在正则表达式中使用 \b,尽量不匹配以 $ 开头的单词

[英]Using \b in a regex, trying not to match words that start with $

I'm having trouble getting the desired output using negative lookahead.我无法使用负前瞻获得所需的 output。

import re
text = "$FOO FOO $BAR BAR"

# Expected. Return words without 'F'.
re.findall(r"\b(?!F)\w+", text)
> ['BAR', 'BAR']

# Expected. Return words without 'B'.
re.findall(r"\b(?!B)\w+", text)
> ['FOO', 'FOO']

# Unexpected. Return words without '$'.
re.findall(r"\b(?!\$)\w+", text)
> ['FOO', 'FOO', 'BAR', 'BAR']

The first two work as expected.前两个按预期工作。 I expect the last one to return the list ['FOO', 'BAR'] matching words without the "$" character.我希望最后一个返回列表['FOO', 'BAR']匹配没有“$”字符的单词。 Because it's a special character, I've tried various ways to escape it but haven't found the right solution.因为它是一个特殊字符,所以我尝试了各种方法来逃避它,但没有找到正确的解决方案。

You actually need to fix the pattern in the following way:您实际上需要通过以下方式修复模式:

\b(?<!\$)\w+

See the Python demo .请参阅Python 演示

The reason is that \b(?!\$)\w+ is equal to \b\w+ since $ cannot be matched with \w , so no need to restrict the first char matched with \w with the (?!\$) negative lookahead.原因是\b(?!\$)\w+等于\b\w+因为$不能与\w匹配,所以不需要用(?!\$)限制与\w匹配的第一个字符负前瞻。 You need to restrict the char that comes immediately before the first char matched wit \w , and that is done with a negative lookbehind , here, (?<!\$) .您需要限制紧接在第一个与\w匹配的 char 之前出现的 char ,这是通过负面的后视来完成的,这里是(?<!\$)

import re
text = "$FOO FOO $BAR BAR"
print(re.findall(r"\b(?<!\$)\w+", text))
# > ['FOO', 'BAR']

Now, as you say (?<=^)(??\$)\w+|(?<=\s)(?!\$)\w+ works for you, you can now see that you may safely remove the lookaheads from the regex as they do not do anything meaningful, and the regex becomes (?<=^)\w+|(?<=\s)\w+ .现在,正如您所说(?<=^)(??\$)\w+|(?<=\s)(?!\$)\w+适合您,您现在可以看到您可以安全地删除前瞻来自正则表达式,因为它们没有做任何有意义的事情,并且正则表达式变为(?<=^)\w+|(?<=\s)\w+ This expression can be shrunk further into a slim (?<!\S)\w+ pattern that matches any one or more word chars that are immediately preceded with start of string or a whitespace.该表达式可以进一步缩小为一个苗条的(?<!\S)\w+模式,该模式匹配任何一个或多个紧跟在字符串开头或空格之前的单词字符。

Thanks to Charles for putting me on the right track.感谢查尔斯让我走上正轨。 I had an incorrect understanding of how boundary characters function.我对边界字符 function 的理解不正确。

import re
text = "FOO $FOO FOO $BAR BAR"


re.findall('(?<=^)(?!\$)\w+|(?<=\s)(?!\$)\w+', text)
> ['FOO', 'FOO', 'BAR']

Replacing \b with a negative look-behind that matched a space or the beginning of a string gives the desired output.\b替换为匹配空格或字符串开头的否定后视会得到所需的 output。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM