python regex re.findall taking too long to execute

Question

I want to remove all the words that end with a dot '.' in a file. My file is around 15 MB and would have more than 400,000 words. I am using re.findall to find such words and replace them.

for w in re.findall(r'([a-zA-Z0-9]+\.)', test_dict):
    test_dict = test_dict.replace(w, ' ')

This is taking very long time to execute. Is there a way to improve performance or any other alternate method to find and replace such words?

Answer 1

You can try to use re.sub instead of looping over the result of re.findall .

# Example text:
text = 'this is. a text with periods.'

re.sub(r'([a-zA-Z0-9]+\.)', ' ', text)

This returns the same result as your loop:

'this   a text with  '

On a relatively small document (179KB, Romeo and Juliet), the re.findall loop takes about 0.369 seconds, and re.sub takes about 0.0091 seconds.

Answer 2

In Python, you can loop over a file line-by-line and a line word-by-word.

So you might consider:

with open(your_file) as f_in, open(new_file, 'w') as f_out:
    for line in f_in:
         f_out.write(' '.join(w for w in line.split() if not w.endswith('.')+'\n')
# then decide if you want to overwrite your_file with new_file

python regex re.findall taking too long to execute

Question

2 answers

solution1
3 ACCPTED 2018-06-28 14:34:24

solution2
0 2018-06-28 14:58:17

python regex re.findall taking too long to execute

Question

2 answers

solution1 3 ACCPTED 2018-06-28 14:34:24

solution2 0 2018-06-28 14:58:17

solution1
3 ACCPTED 2018-06-28 14:34:24

solution2
0 2018-06-28 14:58:17