How can I speed up an email-finding regular expression when searching through a massive string?

Question

I have a massive string. It looks something like this:

hej34g934gj93gh398gie foo@bar.com e34y9u394y3h4jhhrjg bar@foo.com hge98gej9rg938h9g34gug

Except that it's much longer (1,000,000+ characters).

My goal is to find all the email addresses in this string.

I've tried a number of solutions, including this one:

#matches foo@bar.com and bar@foo.com
re.findall(r'[\w\.-]{1,100}@[\w\.-]{1,100}', line)

Although the above code technically works, it takes an insane amount of time to execute. I'm not sure if it counts as catastrophic backtracking or if it's just really inefficient, but whatever the case, it's not good enough for my use case.

I suspect that there's a better way to do this. For example, if I use this regex to only search for the latter part of the email addresses:

#matches @bar.com and @foo.com
re.findall(r'@[\w-]{1,256}[\.]{1}[a-z.]{1,64}', line)

It executes in just a few milliseconds.

I'm not familiar enough with regex to write the rest, but I assume that there's some way to find the @xx part first and then check the first part afterwards? If so, then I'm guessing that would be a lot quicker.

Answer 1

Don't use regex on the whole string. Regex are slow. Avoiding them is your best bet to better overall performance.

My first approach would look like this:

Split the string on spaces.
Filter the result down to the parts that contain @ .
Create a pre-compiled regex.
Use regex on the remaining parts only to remove false positives.

Another idea:

in a loop....
use .index("@") to find the position of the next candidate
extend eg 100 characters to the left, 50 to the right to cover name and domain
adapt the range depending on the last email address you found so you don't overlap
check the range with a regex, if it matches, yield the match

Answer 2

You can use PyPi regex module by Matthew Barnett, that is much more powerful and stable when it comes to parsing long texts. This regex library has some basic checks for pathological cases implemented. The library author mentions at his post:

The internal engine no longer interprets a form of bytecode but instead follows a linked set of nodes, and it can work breadth-wise as well as depth-first, which makes it perform much better when faced with one of those 'pathological' regexes.

However, there is yet another trick you may implement in your regex: Python re (and regex , too) optimize matching at word boundary locations. Thus, if your pattern is supposed to match at a word boundary, always start your pattern with it. In your case, r'\b[\w.-]{1,100}@[\w.-]{1,100}' or r'\b\w[\w.-]{0,99}@[\w.-]{1,100}' should also work much better than the original pattern without a word boundary.

Python test:

import re, regex, timeit
text='your_long_sting'
re_pattern=re.compile(r'\b\w[\w.-]{0,99}@[\w.-]{1,100}')
regex_pattern=regex.compile(r'\b\w[\w.-]{0,99}@[\w.-]{1,100}')
timeit.timeit("p.findall(text)", 'from __main__ import text, re_pattern as p', number=100000)
# => 6034.659449000001
timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern as p', number=100000)
# => 218.1561693

How can I speed up an email-finding regular expression when searching through a massive string?

Question

2 answers

solution1
1 2020-05-15 08:40:18

solution2
1 ACCPTED 2020-05-16 12:40:51

How can I speed up an email-finding regular expression when searching through a massive string?

Question

2 answers

solution1 1 2020-05-15 08:40:18

solution2 1 ACCPTED 2020-05-16 12:40:51

solution1
1 2020-05-15 08:40:18

solution2
1 ACCPTED 2020-05-16 12:40:51