简体   繁体   English

如何根据条件更改正则表达式中的量词?

[英]How to change a quantifier in a Regex based on a condition?

I would like to find words of length >= 1 which may contain a ' or a - within. 我想找到长度> = 1的单词,其中可能包含'- Here is a test string: 这是一个测试字符串:

a quake-prone area- (aujourd'hui-

In Python, I'm currently using this regex: 在Python中,我目前正在使用此正则表达式:

string = "a quake-prone area- (aujourd'hui-"
RE_WORDS = re.compile(r'[a-z]+[-\']?[a-z]+')
words = RE_WORDS.findall(string)

I would like to get this result: 我想得到这个结果:

>>> words
>>> [u'a', u'quake-prone', u'area', u"aujourd'hui"]

but I get this instead: 但是我得到了这个:

>>> words
>>> [u'quake-prone', u'area', u"aujourd'hui"]

Unfortunately, because of the last + quantifier, it skips all words of length 1. If I use the * quantifier, it will find a but also area- instead of area . 不幸的是,由于使用了最后一个+量词,它会跳过所有长度为1的单词。如果我使用*量词,它将找到a但也是area-而不是area

Then how could create a conditional regex saying: if the word contains an apostrophe or an hyphen, use the + quantifier else use the * quantifier ? 然后如何创建条件正则表达式: if the word contains an apostrophe or an hyphen, use the + quantifier else use the * quantifier

I suggest you to change the last [-\\']?[az]+ part as optional by putting it into a group and then adding a ? 我建议您将最后一个[-\\']?[az]+部分更改为可选部分,方法是将其放入组中,然后添加一个? quantifier next to that group. 该组旁边的量词。

>>> string = "a quake-prone area- (aujourd'hui-"
>>> RE_WORDS = re.compile(r'[a-z]+(?:[-\'][a-z]+)?')
>>> RE_WORDS.findall(string)
['a', 'quake-prone', 'area', "aujourd'hui"]

Reason for why the a is not printed is because of your regex contains two [az]+ which asserts that there must be atleast two lowercase letters present in the match. 不能打印a原因是因为您的正则表达式包含两个[az]+ ,它断言匹配中必须存在至少两个小写字母。

Note that the regex i mentioned won't match area- because (?:[-\\'][az]+)? 请注意,我提到的正则表达式将不会匹配area-因为(?:[-\\'][az]+)? optional group asserts that there must be atleast one lowercase letter would present just after to the - symbol. 可选组断言,在-符号之后必须至少存在一个小写字母。 If no, then stop matching until it reaches the hyphen. 如果否,则停止匹配,直到到达连字符为止。 So that you got area at the output instead of area- because there isn't an lowercase letter exists next to the - . 这样您就可以在输出中得到area而不是area-因为-旁边没有小写字母。 Here it stops matching until it finds an hyphen without following lowercase letter. 在这里它将停止匹配,直到找到连字符而不跟随小写字母为止。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM