I have a massive string. It looks something like this:
hej34g934gj93gh398gie foo@bar.com e34y9u394y3h4jhhrjg bar@foo.com hge98gej9rg938h9g34gug
Except that it's much longer (1,000,000+ characters).
My goal is to find all the email addresses in this string.
I've tried a number of solutions, including this one:
#matches foo@bar.com and bar@foo.com
re.findall(r'[\w\.-]{1,100}@[\w\.-]{1,100}', line)
Although the above code technically works, it takes an insane amount of time to execute. I'm not sure if it counts as catastrophic backtracking or if it's just really inefficient, but whatever the case, it's not good enough for my use case.
I suspect that there's a better way to do this. For example, if I use this regex to only search for the latter part of the email addresses:
#matches @bar.com and @foo.com
re.findall(r'@[\w-]{1,256}[\.]{1}[a-z.]{1,64}', line)
It executes in just a few milliseconds.
I'm not familiar enough with regex to write the rest, but I assume that there's some way to find the @xx part first and then check the first part afterwards? If so, then I'm guessing that would be a lot quicker.
Don't use regex on the whole string. Regex are slow. Avoiding them is your best bet to better overall performance.
My first approach would look like this:
@
.Another idea:
.index("@")
to find the position of the next candidateyield
the match You can use PyPi regex
module by Matthew Barnett, that is much more powerful and stable when it comes to parsing long texts. This regex library has some basic checks for pathological cases implemented. The library author mentions at his post:
The internal engine no longer interprets a form of bytecode but instead follows a linked set of nodes, and it can work breadth-wise as well as depth-first, which makes it perform much better when faced with one of those 'pathological' regexes.
However, there is yet another trick you may implement in your regex: Python re
(and regex
, too) optimize matching at word boundary locations. Thus, if your pattern is supposed to match at a word boundary, always start your pattern with it. In your case, r'\b[\w.-]{1,100}@[\w.-]{1,100}'
or r'\b\w[\w.-]{0,99}@[\w.-]{1,100}'
should also work much better than the original pattern without a word boundary.
Python test:
import re, regex, timeit
text='your_long_sting'
re_pattern=re.compile(r'\b\w[\w.-]{0,99}@[\w.-]{1,100}')
regex_pattern=regex.compile(r'\b\w[\w.-]{0,99}@[\w.-]{1,100}')
timeit.timeit("p.findall(text)", 'from __main__ import text, re_pattern as p', number=100000)
# => 6034.659449000001
timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern as p', number=100000)
# => 218.1561693
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.