np.where how to improve performance with regular expression?

Question

I am new to python numpy and regular expression. I am trying to extract the patterns from the pandas text column from each row. There are many possible cases available as per my requirement so I wrote below different regular expressions for that. To iterate and search for the given pattern i am using python's np.where but i am getting performance issue. Is there any way to improve the performance or any alternative to achieve below output.

x_train['Description'] is my pandas column.

54672 rows in my dataset.


Code:

pattern1 = re.compile(r'\bAGE[a-z]?\b[\s\w]*\W+\d+.*(?:year[s]|month[s]?)',re.I)

pattern2 = re.compile(r'\bfor\b[\s]*age[s]?\W+\d+\W+(?:month[s]?|year[s]?)',re.I)

pattern3 = re.compile(r'\badult[s]?.[\w\s]\d+',re.I)

pattern4 = re.compile(r'\b\d+\W+(?:month[s]?|year[s]?)\W+of\W+age[a-z]?',re.I)

pattern5 = re.compile(r'[a-z][a-z\s]+(?:month[s]?|year[s]?)[\w\s]+age[s]?',re.I) 

pattern6 = re.compile(r'\bage.*?\s\d+[\s]*\+',re.I)

pattern7 = re.compile(r'\bbetween[\s]*age[s]?[\s]*\d+.*(?:month[s]?|year[s]?)',re.I)

pattern8 = re.compile(r'\b\d+[\w+\s]*?(?:\band\sup\b|\band\sabove\b|\band\sold[a-z]*\b)',re.I)

np_time = time.time()

x_train['pattern'] = np.where(x_train['Description'].str.contains(pattern1), x_train['Description'].str.findall(pattern1),

                              np.where (x_train['Description'].str.contains(pattern2), x_train['Description'].str.findall(pattern2),

                              np.where (x_train['Description'].str.contains(pattern3), x_train['Description'].str.findall(pattern3),

                              np.where (x_train['Description'].str.contains(pattern4), x_train['Description'].str.findall(pattern4),  

                              np.where (x_train['Description'].str.contains(pattern5), x_train['Description'].str.findall(pattern5),  

                              np.where (x_train['Description'].str.contains(pattern6), x_train['Description'].str.findall(pattern6),  

                              np.where (x_train['Description'].str.contains(pattern7), x_train['Description'].str.findall(pattern7),  

                              np.where (x_train['Description'].str.contains(pattern8), x_train['Description'].str.findall(pattern8),                                


                                                'NO PATTERN')      

                                                             )))))))


print "pattern extraction ran in = "
print("--- %s seconds ---" % (time.time() - np_time))



pattern extraction ran in = 
--- 99.5106501579 seconds ---

Sample Input and output above code

        Description                                  pattern     

    0  **AGE RANGE: 6 YEARS** AND UP 10' LONG          AGE RANGE: 6 YEARS 
       STRING OF BEAUTIFUL LIGHTS MULTIPLE 
       LIGHT EFFECTS FADE IN AND OUT

    1  DIMENSIONS   OVERALL HEIGHT - TOP           AGE GROUP: -2 YEARS/3 TO 4 
       TO BOTTOM: 34.5'' OVERALL WIDTH - SIDE      YEARS/5 TO 6 YEARS/7 TO 8 
                                                   YEARS/7 TO 8 YEARS.
       TO SIDE: 20''  OVERALL DEPTH - 
       FRONT TO BACK:      15''  COUNTER TOP 
       HEIGHT - TOP TO BOTTOM: 23''  OVERALL 
       PRODUCT WEIGHT: 38 LBS "   
       **"AGE GROUP: -2 YEARS/3 TO 4 YEARS/5 TO 6 
        YEARS/7 TO 8 YEARS**.

   2   THE FLAME-RETARDANT FOAM ALSO CONTAINS              AGED 1-5 YEARS
       ANTIMICROBIAL PROTECTION, SO IT WON'T GROW 
       MOLD OR BACTERIA IF IT GETS WET. THE 
       BRIGHTLY-COLORED 
       VINYL EXTERIOR IS EASY TO WIPE CLEAN. FOAMMAN 
       IS DESIGNED FOR KIDS **AGED 1-5 YEARS**

Answer 1

There a couple of things you can try:

First, you need to identify the slower regular expressions. You can do this for example with https://regex101.com/ observing the 'steps' value.

I inspected the regexes and number 5 and 8 are the slowest ones.

27800 steps = [a-z][a-z\s]+(?:month[s]?|year[s]?)[\w\s]+age[s]?
 4404 steps= \b\d+[\w+\s]*?(?:\band\sup\b|\band\sabove\b|\band\sold[a-z]*\b)

You might consider optimize those 2 regexes.

For example, you could rewrite this \\b\\d+[\\w+\\s]*?(?:\\band\\sup\\b|\\band\\sabove\\b|\\band\\sold[az]*\\b)

into this \\b\\d+[\\w+\\s]*?(?:\\band\\s(?:up|above|old[az]*\\b)) which uses about 50% less steps.

For the other regexp, there are a couple of options. You may rewrite it as:

[AZ][A-LN-XZ\\s]+(?:(?:Y(?!EARS?)|M(?!ONTHS?))[A-LN-XZ\\s]+)*(?:MONTHS?|YEARS?)[\\w\\s]+AGE[S]?

Which is a bit faster. Not much, although (27800 vs 23800)

However, what it really seems to speed it up is makeing it case sensitive.

The original regex as case sensitive executes just 3700 steps. And the optimized one 1470.

So you could just uppercase/lowercase your whole string and use it on your (case sensitive) regexes. You may not even need to transform your string as on your sample it seems to be uppercase anyways.

Another thing to look at is the order of the regexes being tested. If there are some regexes that are more probable to match than others, they should be tested first.

If you can't know such probabilities and you think they are more or less the same, you may consider putting simpler regexes first. As always testing a complex regex that It is hard to match is a waste of time.

Finally, when you have options such (a|b|c) you may consider putting the most probable at the beginning, for the same reason as before.

np.where how to improve performance with regular expression?

Question

1 answers

solution1
1 2018-08-06 12:23:32

np.where how to improve performance with regular expression?

Question

1 answers

solution1 1 2018-08-06 12:23:32

solution1
1 2018-08-06 12:23:32