简体   繁体   中英

Regex taking too long (large data set)

I have the following problem - the data comes from a db (oracle), but I was hoping to solve it in python (with regex). However, I fear the process won't finish in a reasonable time, so I could use some suggestions. Pulling the data out of the db into python lists, I have the following: keywords, a list of 5000 strings, each of length <=40 search_phrases, list of 1/3 million strings, each of length between 50 and 150 found_phrasess, list of 30 000 strings, each of length between 20 and 50

I want to search through search_words for patterns of the form pattern1 = number keyword pattern2 = number keyword1 anything number keyword2

collect these patterns in a list, then remove those which are already in found_phrases.

First I tried the following in python:

found = []
p1 = r'[0-9.]+[/\s-]*'
pattern1 = re.compile(r'|'.join([p1 + word for word in keywords])
for phrase in search_phrases:
    found.extend(re.findall(pattern1, phrase))
return set(found).difference(found_phrases)

This doesn't work because of the OverflowError in the regular expression. So instead I did a double for-loop:

for phrase in search_phrases:
    for word in keywords:
        found.extend(re.findall(p1 + word, phrase))

but this is taking way too long (ie, still hasn't finished).

If you have any suggestions about this, either how to complete it faster in python, or recommendations for staying in the db (the lists are simply the distinct column entries from two different tables) and learning how to do regex there, please let me know. Thanks.

Update1:

Right now I am only searching for pattern1 (time constraints), and switched the order of the for loop to

for word in keywords:
    for phrase in search_phrases:
        found.extend(re.findall(p1+word, phrase))

With this order, it runs on a sample search_phrase list (30 000 elements) in about 90 seconds.

If if grep -f keywords search_phrases, the resulting file is only about 5% shorter (most of the search_phrases will match).

Sample keywords: 'g', 'gr', 'G', 'gram', 'grams', 'mg', 'milli gram', 'Milli-gram' , ... (plus all the variations you can think of for measuring mass) Sample search_phrases: '

You can use htql.RegEx from http://htql.net . It can handle large lists well. Here is the example from its website:

import htql; 
address = '88-21 64th st , Rego Park , New York 11374'
states=['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 
    'Delaware', 'District Of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 
    'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 
    'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 
    'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 
    'Oregon', 'PALAU', 'Pennsylvania', 'PUERTO RICO', 'Rhode Island', 'South Carolina', 'South Dakota', 
    'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 
    'Wyoming']; 

a=htql.RegEx(); 
a.setNameSet('states', states);

state_zip1=a.reSearchStr(address, "&[s:states][,\s]+\d{5}", case=False)[0]; 
# state_zip1 = 'New York 11374'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM