[英]How can I express 'repeat this part' in a regular expression?
Suppose I want to match a string like this: 假设我想匹配一个这样的字符串:
123(432)123(342)2348(34) 123(432)123(342)2348(34)
I can match digits like 123
with [\\d]*
and (432)
with \\([\\d]+\\)
. 我可以将123
与[\\d]*
和(432)
与\\([\\d]+\\)
匹配。
How can match the whole string by repeating either of the 2 patterns? 如何通过重复两种模式中的任何一种来匹配整个字符串?
I tried [[\\d]* | \\([\\d]+\\)]+
我试过[[\\d]* | \\([\\d]+\\)]+
[[\\d]* | \\([\\d]+\\)]+
, but this is incorrect. [[\\d]* | \\([\\d]+\\)]+
,但这是不正确的。
I am using python re module. 我正在使用python re模块。
I think you need this regex: 我认为你需要这个正则表达式:
"^(\d+|\(\d+\))+$"
and to avoid catastrophic backtracking you need to change it to a regex like this: 并且为了避免灾难性的回溯,你需要将它改为像这样的正则表达式:
"^(\d|\(\d+\))+$"
You can use a character class to match the whole of string : 您可以使用字符类来匹配整个字符串:
[\d()]+
But if you want to match the separate parts in separate groups you can use re.findall
with a spacial regex based on your need, for example : 但是,如果要匹配单独组中的单独部分,可以根据需要将re.findall
与空间正则表达式匹配,例如:
>>> import re
>>> s="123(432)123(342)2348(34)"
>>> re.findall(r'\d+\(\d+\)',s)
['123(432)', '123(342)', '2348(34)']
>>>
Or : 要么 :
>>> re.findall(r'(\d+)\((\d+)\)',s)
[('123', '432'), ('123', '342'), ('2348', '34')]
Or you can just use \\d+
to get all the numbers : 或者您可以使用\\d+
来获取所有数字:
>>> re.findall(r'\d+',s)
['123', '432', '123', '342', '2348', '34']
If you want to match the patter \\d+\\(\\d+\\)
repeatedly you can use following regex : 如果要重复匹配patter \\d+\\(\\d+\\)
,可以使用以下正则表达式:
(?:\d+\(\d+\))+
You can achieve it with this pattern: 你可以用这种模式实现它:
^(?=.)\d*(?:\(\d+\)\d*)*$
(?=.)
ensures there is at least one character (if you want to allow empty strings, remove it). (?=.)
确保至少有一个字符(如果要允许空字符串,请将其删除)。
\\d*(?:\\(\\d+\\)\\d*)*
is an unrolled sub-pattern. \\d*(?:\\(\\d+\\)\\d*)*
是展开的子图案。 Explanation: With a bactracking regex engine, when you have a sub-pattern like (A|B)*
where A and B are mutually exclusive (or at least when the end of A or B doesn't match respectively the beginning of B or A), you can rewrite the sub-pattern like this: A*(BA*)*
or B*(AB*)*
. 说明:使用bactracking正则表达式引擎,当你有一个子模式,如(A|B)*
,其中A和B互斥(或至少当A或B的结尾分别不匹配B的开头或A),您可以像这样重写子模式: A*(BA*)*
或B*(AB*)*
。 For your example, it replaces (?:\\d+|\\(\\d+\\))*
This new form is more efficient: it reduces the steps needed to obtain a match, it avoids a great part of the eventual bactracking. 对于你的例子,它取代了(?:\\d+|\\(\\d+\\))*
这个新形式更有效:它减少了获得匹配所需的步骤,它避免了最终的bactracking的很大一部分。
Note that you can improve it more, if you emulate an atomic group (?>....)
with this trick (?=(....))\\1
that uses the fact that a lookahead is naturally atomic: 请注意,如果使用此技巧 (?=(....))\\1
来模拟原子组 (?>....)
,使用前瞻自然是原子的事实,则可以进一步改进它:
^(?=.)(?=(\d*(?:\(\d+\)\d*)*))\1$
demo (compare the number of steps needed with the previous version and check the debugger to see what happens) 演示 (比较先前版本所需的步骤数,并检查调试器以查看发生的情况)
Note: if you don't want two consecutive numbers enclosed in parenthesis, you only need to change the quantifier *
with +
inside the non-capturing group and to add (?:\\(\\d+\\))?
注意:如果你不想在括号中包含两个连续的数字,你只需要在非捕获组中用+
更改量词*
并添加(?:\\(\\d+\\))?
at the end of the pattern, before the anchor $
: 在模式结束时,在锚点$
之前:
^(?=.)\d*(?:\(\d+\)\d+)*(?:\(\d+\))?$
or 要么
^(?=.)(?=(\d*(?:\(\d+\)\d+)*(?:\(\d+\))?))\1$
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.