I am new to python numpy and regular expression. I am trying to extract the patterns from the pandas text column from each row. There are many possible cases available as per my requirement so I wrote below different regular expressions for that. To iterate and search for the given pattern i am using python's np.where
but i am getting performance issue. Is there any way to improve the performance or any alternative to achieve below output.
x_train['Description'] is my pandas column.
54672 rows in my dataset.
Code:
pattern1 = re.compile(r'\bAGE[a-z]?\b[\s\w]*\W+\d+.*(?:year[s]|month[s]?)',re.I)
pattern2 = re.compile(r'\bfor\b[\s]*age[s]?\W+\d+\W+(?:month[s]?|year[s]?)',re.I)
pattern3 = re.compile(r'\badult[s]?.[\w\s]\d+',re.I)
pattern4 = re.compile(r'\b\d+\W+(?:month[s]?|year[s]?)\W+of\W+age[a-z]?',re.I)
pattern5 = re.compile(r'[a-z][a-z\s]+(?:month[s]?|year[s]?)[\w\s]+age[s]?',re.I)
pattern6 = re.compile(r'\bage.*?\s\d+[\s]*\+',re.I)
pattern7 = re.compile(r'\bbetween[\s]*age[s]?[\s]*\d+.*(?:month[s]?|year[s]?)',re.I)
pattern8 = re.compile(r'\b\d+[\w+\s]*?(?:\band\sup\b|\band\sabove\b|\band\sold[a-z]*\b)',re.I)
np_time = time.time()
x_train['pattern'] = np.where(x_train['Description'].str.contains(pattern1), x_train['Description'].str.findall(pattern1),
np.where (x_train['Description'].str.contains(pattern2), x_train['Description'].str.findall(pattern2),
np.where (x_train['Description'].str.contains(pattern3), x_train['Description'].str.findall(pattern3),
np.where (x_train['Description'].str.contains(pattern4), x_train['Description'].str.findall(pattern4),
np.where (x_train['Description'].str.contains(pattern5), x_train['Description'].str.findall(pattern5),
np.where (x_train['Description'].str.contains(pattern6), x_train['Description'].str.findall(pattern6),
np.where (x_train['Description'].str.contains(pattern7), x_train['Description'].str.findall(pattern7),
np.where (x_train['Description'].str.contains(pattern8), x_train['Description'].str.findall(pattern8),
'NO PATTERN')
)))))))
print "pattern extraction ran in = "
print("--- %s seconds ---" % (time.time() - np_time))
pattern extraction ran in =
--- 99.5106501579 seconds ---
Sample Input and output above code
Description pattern
0 **AGE RANGE: 6 YEARS** AND UP 10' LONG AGE RANGE: 6 YEARS
STRING OF BEAUTIFUL LIGHTS MULTIPLE
LIGHT EFFECTS FADE IN AND OUT
1 DIMENSIONS OVERALL HEIGHT - TOP AGE GROUP: -2 YEARS/3 TO 4
TO BOTTOM: 34.5'' OVERALL WIDTH - SIDE YEARS/5 TO 6 YEARS/7 TO 8
YEARS/7 TO 8 YEARS.
TO SIDE: 20'' OVERALL DEPTH -
FRONT TO BACK: 15'' COUNTER TOP
HEIGHT - TOP TO BOTTOM: 23'' OVERALL
PRODUCT WEIGHT: 38 LBS "
**"AGE GROUP: -2 YEARS/3 TO 4 YEARS/5 TO 6
YEARS/7 TO 8 YEARS**.
2 THE FLAME-RETARDANT FOAM ALSO CONTAINS AGED 1-5 YEARS
ANTIMICROBIAL PROTECTION, SO IT WON'T GROW
MOLD OR BACTERIA IF IT GETS WET. THE
BRIGHTLY-COLORED
VINYL EXTERIOR IS EASY TO WIPE CLEAN. FOAMMAN
IS DESIGNED FOR KIDS **AGED 1-5 YEARS**
There a couple of things you can try:
First, you need to identify the slower regular expressions. You can do this for example with https://regex101.com/ observing the 'steps' value.
I inspected the regexes and number 5 and 8 are the slowest ones.
27800 steps = [a-z][a-z\s]+(?:month[s]?|year[s]?)[\w\s]+age[s]?
4404 steps= \b\d+[\w+\s]*?(?:\band\sup\b|\band\sabove\b|\band\sold[a-z]*\b)
You might consider optimize those 2 regexes.
For example, you could rewrite this \\b\\d+[\\w+\\s]*?(?:\\band\\sup\\b|\\band\\sabove\\b|\\band\\sold[az]*\\b)
into this \\b\\d+[\\w+\\s]*?(?:\\band\\s(?:up|above|old[az]*\\b))
which uses about 50% less steps.
For the other regexp, there are a couple of options. You may rewrite it as:
[AZ][A-LN-XZ\\s]+(?:(?:Y(?!EARS?)|M(?!ONTHS?))[A-LN-XZ\\s]+)*(?:MONTHS?|YEARS?)[\\w\\s]+AGE[S]?
Which is a bit faster. Not much, although (27800 vs 23800)
However, what it really seems to speed it up is makeing it case sensitive.
The original regex as case sensitive executes just 3700 steps. And the optimized one 1470.
So you could just uppercase/lowercase your whole string and use it on your (case sensitive) regexes. You may not even need to transform your string as on your sample it seems to be uppercase anyways.
Another thing to look at is the order of the regexes being tested. If there are some regexes that are more probable to match than others, they should be tested first.
If you can't know such probabilities and you think they are more or less the same, you may consider putting simpler regexes first. As always testing a complex regex that It is hard to match is a waste of time.
Finally, when you have options such (a|b|c) you may consider putting the most probable at the beginning, for the same reason as before.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.