[英]How to change a quantifier in a Regex based on a condition?
I would like to find words of length >= 1 which may contain a '
or a -
within. 我想找到长度> = 1的单词,其中可能包含
'
或-
。 Here is a test string: 这是一个测试字符串:
a quake-prone area- (aujourd'hui-
In Python, I'm currently using this regex: 在Python中,我目前正在使用此正则表达式:
string = "a quake-prone area- (aujourd'hui-"
RE_WORDS = re.compile(r'[a-z]+[-\']?[a-z]+')
words = RE_WORDS.findall(string)
I would like to get this result: 我想得到这个结果:
>>> words
>>> [u'a', u'quake-prone', u'area', u"aujourd'hui"]
but I get this instead: 但是我得到了这个:
>>> words
>>> [u'quake-prone', u'area', u"aujourd'hui"]
Unfortunately, because of the last +
quantifier, it skips all words of length 1. If I use the *
quantifier, it will find a
but also area-
instead of area
. 不幸的是,由于使用了最后一个
+
量词,它会跳过所有长度为1的单词。如果我使用*
量词,它将找到a
但也是area-
而不是area
。
Then how could create a conditional regex saying: if the word contains an apostrophe or an hyphen, use the + quantifier else use the * quantifier
? 然后如何创建条件正则表达式:
if the word contains an apostrophe or an hyphen, use the + quantifier else use the * quantifier
?
I suggest you to change the last [-\\']?[az]+
part as optional by putting it into a group and then adding a ?
我建议您将最后一个
[-\\']?[az]+
部分更改为可选部分,方法是将其放入组中,然后添加一个?
quantifier next to that group. 该组旁边的量词。
>>> string = "a quake-prone area- (aujourd'hui-"
>>> RE_WORDS = re.compile(r'[a-z]+(?:[-\'][a-z]+)?')
>>> RE_WORDS.findall(string)
['a', 'quake-prone', 'area', "aujourd'hui"]
Reason for why the a
is not printed is because of your regex contains two [az]+
which asserts that there must be atleast two lowercase letters present in the match. 不能打印
a
原因是因为您的正则表达式包含两个[az]+
,它断言匹配中必须存在至少两个小写字母。
Note that the regex i mentioned won't match area-
because (?:[-\\'][az]+)?
请注意,我提到的正则表达式将不会匹配
area-
因为(?:[-\\'][az]+)?
optional group asserts that there must be atleast one lowercase letter would present just after to the -
symbol. 可选组断言,在
-
符号之后必须至少存在一个小写字母。 If no, then stop matching until it reaches the hyphen. 如果否,则停止匹配,直到到达连字符为止。 So that you got
area
at the output instead of area-
because there isn't an lowercase letter exists next to the -
. 这样您就可以在输出中得到
area
而不是area-
因为-
旁边没有小写字母。 Here it stops matching until it finds an hyphen without following lowercase letter. 在这里它将停止匹配,直到找到连字符而不跟随小写字母为止。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.