简体   繁体   中英

Why my Python regular expression pattern run so slowly?

Please see my regular expression pattern code:

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import re

print 'Start'
str1 = 'abcdefgasdsdfswossdfasdaef'
m = re.match(r"([A-Za-z\-\s\:\.]+)+(\d+)\w+", str1) # Want to match something like 'Moto 360x'
print m # None is expected.
print 'Done'

It takes 49 seconds to finish, any problem with the pattern?

See Runaway Regular Expressions: Catastrophic Backtracking .

In brief, if there are extremely many combinations a substring can be split into the parts of the regex, the regex matcher may end up trying them all.

Constructs like (x+)+ and x+x+ practically guarantee this behaviour.

To detect and fix the problematic constructs, the following concept can be used:

  • At conceptual level, the presence of a problematic construct means that your regex is ambiguous - ie if you disregard greedy/lazy behaviour, there's no single "correct" split of some text into the parts of the regex (or, equivalently, a subexpression thereof). So, to avoid/fix the problems, you need to see and eliminate all ambiguities.

    • One way to do this is to

      • always split the text into its meaningful parts (=parts that have separate meanings for the task at hand), and
      • define the parts in such a way that they cannot be confused (=using the same characteristics that you yourself would use to tell which is which if you were parsing it by hand)

Just repost the answer and solution in comments from nhahtdh and Marc B:

([A-Za-z\\-\\s\\:\\.]+)+ --> [A-Za-z\\-\\s\\:\\.]+

Thanks so much to nhahtdh and Marc B!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM