简体   繁体   中英

100% CPU usage with a regexp depending on input length

I'm trying to come up with a regexp in Python that has to match any character but avoiding three or more consecutive commas or semicolons. In other words, only up to two consecutive commas or semicolons are allowed.

So this is what I currently have:

^(,|;){,2}([^,;]+(,|;){,2})*$

And it seems to work as expected:

>>> r.match('')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo,')
<_sre.SRE_Match object at 0x7f23af840750>
>>> r.match('foo, a')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo, ,')
<_sre.SRE_Match object at 0x7f23af840750>
>>> r.match('foo, ,,a')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo, ,,,')
>>> r.match('foo, ,,,;')
>>> r.match('foo, ,, ;;')
<_sre.SRE_Match object at 0x7f23af840750>

But as I start to increase the length of the input text, the regexp seems to need way more time to give a response.

>>> r.match('foo, bar, baz,, foo')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo, bar, baz,, fooooo, baaaaar')
<_sre.SRE_Match object at 0x7f23af840750>
>>> r.match('foo, bar, baz,, fooooo, baaaaar,')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo, bar, baz,, fooooo, baaaaar,,')
<_sre.SRE_Match object at 0x7f23af840750>
>>> r.match('foo, bar, baz,, fooooo, baaaaar,,,')
>>> r.match('foo, bar, baz,, fooooo, baaaaar,,,,')
>>> r.match('foo, bar, baz,, fooooo, baaaaar, baaaaaaz,,,,')

And finally it gets completely stuck at this stage and the CPU usage goes up to 100%.

I'm not sure if the regexp could be optimized or there's something else involved, any help appreciated.

You're running into catastrophic backtracking .

The reason for this is that you have made the separators optional, and therefore the [^,;]+ part (which is itself in a repeating group) of your regex will try loads of permutations (of baaaaaaaz ) before finally having to admit failure when confronted with more than two commas.

RegexBuddy aborts the match attempt after 1.000.000 steps of the regex engine with your last test string. Python will keep trying.

Imagine the string baaz,,, :

Trying your regex, the regex engine has to check all these:

  1. baaz,,<failure>
  2. baa + z,,<failure>
  3. ba + az,,<failure>
  4. ba + a + z,,<failure>
  5. b + aaz,,<failure>
  6. b + aa + z,,<failure>
  7. b + a + az,,<failure>
  8. b + a + a + z,,<failure>

before declaring overall failure. See how this grows exponentially with each additional character?

Behavior like this can be avoided with possessive quantifiers or atomic groups, both of which are sadly not supported by Python's current regex engine. But you can do an inverse check easily:

if ",,," in mystring or ";;;" in mystring:
    fail()

without needing a regex at all. If ,;, and the likes could also occur and should be excluded, then use Andrew's solution.

I think the following should do what you want:

^(?!.*[,;]{3})

This will fail if the string contains three or more , or ; in a row. If you actually want it to match a character add a . at the end.

This utilizes negative lookahead , which will cause the entire match to fail if the regex .*[,;]{3} would match.

Try this regular expression:

^([^,;]|,($|[^,]|,[^,])|;($|[^;]|;[^;]))*$

It matches repetitively:

  • one single character that is neither , nor ; , or
  • a , that is either not followed by another , or a ,, that is not followed by another , , or
  • a ; that is either not followed by another ; or a ;; that is not followed by another ;

until the end is reached. It is very efficient as it fails early without doing much backtracking.

How about this idea match the ones that have the pattern you don't want ".+,,," In Python just keep those that do not match. Should be fast

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM