I have a large txt file(around 20GB) I want to replace all instances of a list of words from this large file. I am struggling to find a way to optimize this code. This is leading to me processing this file for a very long time.
what could I improve?
corpus_input = open(corpus_in,"rt") corpus_out = open(corpus_out,"wt") for line in corpus_input: temp_str=line for word in dict_keys: if word in line: new_word = word+"_lauren_ipsum" temp_str = re.sub(fr'\b{word}\b',new_word,temp_str) else: continue corpus_out.writelines(temp_str) corpus_input.close() corpus_out.close()
The most important thing for optimisation is to understand, what exactly is performing poorly. Then you can see what can be optimized.
If for example reading and writing takes 99% of the time it's not really worth to optimize the processing of your data. Even if you could speed up the processing by 10 you would just gain 0.9% if reading writing were consuming 99%
I suggest to measure and compare some versions and to post differences in performance. This might lead potential further suggestions to optimise.
In all below examples I replaced writelines
with write
as writelines is probably decomposing your line character by character prior to writing.
In any case. You want to use write
You should already gain a speedup of about 5.
1.) Just reading and writing
with open(corpus_in,"rt") as corpus_input, open(corpus_out,"wt")
as corpus_out:
for line in corpus_input:
corpus_out.write(line)
2.) Just reading and writing with a bigger buffer
import io
BUF_SIZE = 50 * io.DEFAULT_BUFFER_SIZE # try other buffer sizes if you see an impact
with open(corpus_in,"rt", BUF_SIZE) as corpus_input, open(corpus_out,"wt", BUF_SIZE)
as corpus_out:
for line in corpus_input:
corpus_out.write(line)
For me this increases performance by a few percent
3.) move search regexp and replacement generation out of loop.
rules = []
for word in dict_keys:
rules.append((re.compile(fr'\b{word}\b'), word + "_lorem_ipsum"))
for line in corpus_input:
for regexp, new_word in rules:
line = regexp.sub(new_word, line)
corpus_out.write(line)
On my machine with my frequency of lines containing words this solution is in fact slower then the one having the line if word in line
So perhaps try: 3.a) move search regexp and replacement generation out of loop.
rules = []
for word in dict_keys:
rules.append((word, re.compile(fr'\b{word}\b'), word + "_lorem_ipsum"))
for line in corpus_input:
for word, regexp, new_word in rules:
if word in line:
line = regexp.sub(new_word, line)
corpus_out.write(line)
3.b) If all replacement strings are longer than the initial strings, then this would be a little faster.
rules = []
for word in dict_keys:
rules.append((word, re.compile(fr'\b{word}\b'), word + "_lorem_ipsum"))
for line in corpus_input:
temp_line = line
for word, regexp, new_word in rules:
if word in line:
temp_line = regexp.sub(new_word, temp_line)
corpus_out.write(temp_line)
4.) if you really replace always with word + "_lorem_ipsum"
combine the regular expression into one.
regexp = re.compile(fr'\b({"|".join(dict_keys)})\b')
for line in corpus_input:
line = regexp.sub("\1_lorem_ipsum", line)
corpus_out.write(line)
4.a) depending on the word distribution this might be faster:
regexp = re.compile(fr'\b({"|".join(dict_keys)})\b')
for line in corpus_input:
if any(word in line for word in dict_keys):
line = regexp.sub("\1_lorem_ipsum", line)
corpus_out.write(line)
Whether this is more efficient or not depends probably on the number of words to search and replace and the frequency of thise words. I don't have that date.
For 5 words and my distribution slower than 3.a
5) if the words to replace are different you might still try to combine the regexps and use a function to replace
replace_table = {
"word1": "word1_laram_apsam",
"word2": "word2_lerem_epsem",
"word3": "word3_lorom_opsom",
}
def repl(match):
return replace_table[match.group(1)]
regexp = re.compile(fr'\b({"|".join(dict_keys)})\b')
for line in corpus_input:
line = regexp.sub(repl, line)
corpus_out.write(line)
Slower than 5, whether better than 3.a depends on number of words and wird distribution / frequency.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.