简体   繁体   中英

python regex re.findall taking too long to execute

I want to remove all the words that end with a dot '.' in a file. My file is around 15 MB and would have more than 400,000 words. I am using re.findall to find such words and replace them.

for w in re.findall(r'([a-zA-Z0-9]+\.)', test_dict):
    test_dict = test_dict.replace(w, ' ')

This is taking very long time to execute. Is there a way to improve performance or any other alternate method to find and replace such words?

You can try to use re.sub instead of looping over the result of re.findall .

# Example text:
text = 'this is. a text with periods.'

re.sub(r'([a-zA-Z0-9]+\.)', ' ', text)

This returns the same result as your loop:

'this   a text with  '

On a relatively small document (179KB, Romeo and Juliet), the re.findall loop takes about 0.369 seconds, and re.sub takes about 0.0091 seconds.

In Python, you can loop over a file line-by-line and a line word-by-word.

So you might consider:

with open(your_file) as f_in, open(new_file, 'w') as f_out:
    for line in f_in:
         f_out.write(' '.join(w for w in line.split() if not w.endswith('.')+'\n')
# then decide if you want to overwrite your_file with new_file

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM