I want to remove all the words that end with a dot '.' in a file. My file is around 15 MB and would have more than 400,000 words. I am using re.findall
to find such words and replace them.
for w in re.findall(r'([a-zA-Z0-9]+\.)', test_dict):
test_dict = test_dict.replace(w, ' ')
This is taking very long time to execute. Is there a way to improve performance or any other alternate method to find and replace such words?
You can try to use re.sub
instead of looping over the result of re.findall
.
# Example text:
text = 'this is. a text with periods.'
re.sub(r'([a-zA-Z0-9]+\.)', ' ', text)
This returns the same result as your loop:
'this a text with '
On a relatively small document (179KB, Romeo and Juliet), the re.findall
loop takes about 0.369 seconds, and re.sub
takes about 0.0091 seconds.
In Python, you can loop over a file line-by-line and a line word-by-word.
So you might consider:
with open(your_file) as f_in, open(new_file, 'w') as f_out:
for line in f_in:
f_out.write(' '.join(w for w in line.split() if not w.endswith('.')+'\n')
# then decide if you want to overwrite your_file with new_file
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.