简体   繁体   English

如何在正则表达式中表达“重复此部分”?

[英]How can I express 'repeat this part' in a regular expression?

Suppose I want to match a string like this: 假设我想匹配一个这样的字符串:

123(432)123(342)2348(34) 123(432)123(342)2348(34)

I can match digits like 123 with [\\d]* and (432) with \\([\\d]+\\) . 我可以将123[\\d]*(432)\\([\\d]+\\)匹配。

How can match the whole string by repeating either of the 2 patterns? 如何通过重复两种模式中的任何一种来匹配整个字符串?

I tried [[\\d]* | \\([\\d]+\\)]+ 我试过[[\\d]* | \\([\\d]+\\)]+ [[\\d]* | \\([\\d]+\\)]+ , but this is incorrect. [[\\d]* | \\([\\d]+\\)]+ ,但这是不正确的。

I am using python re module. 我正在使用python re模块。

I think you need this regex: 我认为你需要这个正则表达式:

"^(\d+|\(\d+\))+$"

and to avoid catastrophic backtracking you need to change it to a regex like this: 并且为了避免灾难性的回溯,你需要将它改为像这样的正则表达式:

"^(\d|\(\d+\))+$"

You can use a character class to match the whole of string : 您可以使用字符类来匹配整个字符串:

[\d()]+

But if you want to match the separate parts in separate groups you can use re.findall with a spacial regex based on your need, for example : 但是,如果要匹配单独组中的单独部分,可以根据需要将re.findall与空间正则表达式匹配,例如:

>>> import re
>>> s="123(432)123(342)2348(34)"
>>> re.findall(r'\d+\(\d+\)',s)
['123(432)', '123(342)', '2348(34)']
>>> 

Or : 要么 :

>>> re.findall(r'(\d+)\((\d+)\)',s)
[('123', '432'), ('123', '342'), ('2348', '34')]

Or you can just use \\d+ to get all the numbers : 或者您可以使用\\d+来获取所有数字:

>>> re.findall(r'\d+',s)
['123', '432', '123', '342', '2348', '34']

If you want to match the patter \\d+\\(\\d+\\) repeatedly you can use following regex : 如果要重复匹配patter \\d+\\(\\d+\\) ,可以使用以下正则表达式:

(?:\d+\(\d+\))+

You can achieve it with this pattern: 你可以用这种模式实现它:

^(?=.)\d*(?:\(\d+\)\d*)*$

demo 演示

(?=.) ensures there is at least one character (if you want to allow empty strings, remove it). (?=.)确保至少有一个字符(如果要允许空字符串,请将其删除)。

\\d*(?:\\(\\d+\\)\\d*)* is an unrolled sub-pattern. \\d*(?:\\(\\d+\\)\\d*)*是展开的子图案。 Explanation: With a bactracking regex engine, when you have a sub-pattern like (A|B)* where A and B are mutually exclusive (or at least when the end of A or B doesn't match respectively the beginning of B or A), you can rewrite the sub-pattern like this: A*(BA*)* or B*(AB*)* . 说明:使用bactracking正则表达式引擎,当你有一个子模式,如(A|B)* ,其中A和B互斥(或至少当A或B的结尾分别不匹配B的开头或A),您可以像这样重写子模式: A*(BA*)*B*(AB*)* For your example, it replaces (?:\\d+|\\(\\d+\\))* This new form is more efficient: it reduces the steps needed to obtain a match, it avoids a great part of the eventual bactracking. 对于你的例子,它取代了(?:\\d+|\\(\\d+\\))*这个新形式更有效:它减少了获得匹配所需的步骤,它避免了最终的bactracking的很大一部分。

Note that you can improve it more, if you emulate an atomic group (?>....) with this trick (?=(....))\\1 that uses the fact that a lookahead is naturally atomic: 请注意,如果使用此技巧 (?=(....))\\1来模拟原子组 (?>....) ,使用前瞻自然是原子的事实,则可以进一步改进它:

^(?=.)(?=(\d*(?:\(\d+\)\d*)*))\1$

demo (compare the number of steps needed with the previous version and check the debugger to see what happens) 演示 (比较先前版本所需的步骤数,并检查调试器以查看发生的情况)

Note: if you don't want two consecutive numbers enclosed in parenthesis, you only need to change the quantifier * with + inside the non-capturing group and to add (?:\\(\\d+\\))? 注意:如果你不想在括号中包含两个连续的数字,你只需要在非捕获组中用+更改量词*并添加(?:\\(\\d+\\))? at the end of the pattern, before the anchor $ : 在模式结束时,在锚点$之前:

^(?=.)\d*(?:\(\d+\)\d+)*(?:\(\d+\))?$

or 要么

^(?=.)(?=(\d*(?:\(\d+\)\d+)*(?:\(\d+\))?))\1$

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM