简体   繁体   中英

How can I speed up an email-finding regular expression when searching through a massive string?

I have a massive string. It looks something like this:

hej34g934gj93gh398gie foo@bar.com e34y9u394y3h4jhhrjg bar@foo.com hge98gej9rg938h9g34gug

Except that it's much longer (1,000,000+ characters).

My goal is to find all the email addresses in this string.

I've tried a number of solutions, including this one:

#matches foo@bar.com and bar@foo.com
re.findall(r'[\w\.-]{1,100}@[\w\.-]{1,100}', line)

Although the above code technically works, it takes an insane amount of time to execute. I'm not sure if it counts as catastrophic backtracking or if it's just really inefficient, but whatever the case, it's not good enough for my use case.

I suspect that there's a better way to do this. For example, if I use this regex to only search for the latter part of the email addresses:

#matches @bar.com and @foo.com
re.findall(r'@[\w-]{1,256}[\.]{1}[a-z.]{1,64}', line)

It executes in just a few milliseconds.

I'm not familiar enough with regex to write the rest, but I assume that there's some way to find the @xx part first and then check the first part afterwards? If so, then I'm guessing that would be a lot quicker.

Don't use regex on the whole string. Regex are slow. Avoiding them is your best bet to better overall performance.

My first approach would look like this:

  • Split the string on spaces.
  • Filter the result down to the parts that contain @ .
  • Create a pre-compiled regex.
  • Use regex on the remaining parts only to remove false positives.

Another idea:

  • in a loop....
  • use .index("@") to find the position of the next candidate
  • extend eg 100 characters to the left, 50 to the right to cover name and domain
  • adapt the range depending on the last email address you found so you don't overlap
  • check the range with a regex, if it matches, yield the match

You can use PyPi regex module by Matthew Barnett, that is much more powerful and stable when it comes to parsing long texts. This regex library has some basic checks for pathological cases implemented. The library author mentions at his post:

The internal engine no longer interprets a form of bytecode but instead follows a linked set of nodes, and it can work breadth-wise as well as depth-first, which makes it perform much better when faced with one of those 'pathological' regexes.

However, there is yet another trick you may implement in your regex: Python re (and regex , too) optimize matching at word boundary locations. Thus, if your pattern is supposed to match at a word boundary, always start your pattern with it. In your case, r'\b[\w.-]{1,100}@[\w.-]{1,100}' or r'\b\w[\w.-]{0,99}@[\w.-]{1,100}' should also work much better than the original pattern without a word boundary.

Python test:

import re, regex, timeit
text='your_long_sting'
re_pattern=re.compile(r'\b\w[\w.-]{0,99}@[\w.-]{1,100}')
regex_pattern=regex.compile(r'\b\w[\w.-]{0,99}@[\w.-]{1,100}')
timeit.timeit("p.findall(text)", 'from __main__ import text, re_pattern as p', number=100000)
# => 6034.659449000001
timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern as p', number=100000)
# => 218.1561693

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM