简体   繁体   中英

Replacing a set of words in a large text file

I have a large txt file(around 20GB) I want to replace all instances of a list of words from this large file. I am struggling to find a way to optimize this code. This is leading to me processing this file for a very long time.

what could I improve?

 corpus_input = open(corpus_in,"rt") corpus_out = open(corpus_out,"wt") for line in corpus_input: temp_str=line for word in dict_keys: if word in line: new_word = word+"_lauren_ipsum" temp_str = re.sub(fr'\b{word}\b',new_word,temp_str) else: continue corpus_out.writelines(temp_str) corpus_input.close() corpus_out.close()

The most important thing for optimisation is to understand, what exactly is performing poorly. Then you can see what can be optimized.

If for example reading and writing takes 99% of the time it's not really worth to optimize the processing of your data. Even if you could speed up the processing by 10 you would just gain 0.9% if reading writing were consuming 99%

I suggest to measure and compare some versions and to post differences in performance. This might lead potential further suggestions to optimise.

In all below examples I replaced writelines with write as writelines is probably decomposing your line character by character prior to writing.

In any case. You want to use write You should already gain a speedup of about 5.

1.) Just reading and writing

with open(corpus_in,"rt") as corpus_input, open(corpus_out,"wt")
 as corpus_out:
   for line in corpus_input:
       corpus_out.write(line)

2.) Just reading and writing with a bigger buffer

import io

BUF_SIZE = 50 * io.DEFAULT_BUFFER_SIZE # try other buffer sizes if you see an impact
with open(corpus_in,"rt", BUF_SIZE) as corpus_input, open(corpus_out,"wt", BUF_SIZE)
 as corpus_out:
   for line in corpus_input:
corpus_out.write(line)

For me this increases performance by a few percent

3.) move search regexp and replacement generation out of loop.

   rules = []
   for word in dict_keys:
       rules.append((re.compile(fr'\b{word}\b'), word + "_lorem_ipsum"))

   for line in corpus_input:
       for regexp, new_word in rules: 
           line = regexp.sub(new_word, line)
       corpus_out.write(line)

On my machine with my frequency of lines containing words this solution is in fact slower then the one having the line if word in line

So perhaps try: 3.a) move search regexp and replacement generation out of loop.

   rules = []
   for word in dict_keys:
       rules.append((word, re.compile(fr'\b{word}\b'), word + "_lorem_ipsum"))

   for line in corpus_input:
       for word, regexp, new_word in rules: 
           if word in line:
               line = regexp.sub(new_word, line)
       corpus_out.write(line)

3.b) If all replacement strings are longer than the initial strings, then this would be a little faster.

   rules = []
   for word in dict_keys:
       rules.append((word, re.compile(fr'\b{word}\b'), word + "_lorem_ipsum"))

   for line in corpus_input:
       temp_line = line
       for word, regexp, new_word in rules: 
           if word in line:
               temp_line = regexp.sub(new_word, temp_line)
       corpus_out.write(temp_line)

4.) if you really replace always with word + "_lorem_ipsum" combine the regular expression into one.

   regexp = re.compile(fr'\b({"|".join(dict_keys)})\b')

   for line in corpus_input:
       line = regexp.sub("\1_lorem_ipsum", line)
       corpus_out.write(line)

4.a) depending on the word distribution this might be faster:

   regexp = re.compile(fr'\b({"|".join(dict_keys)})\b')

   for line in corpus_input:
       if any(word in line for word in dict_keys):
           line = regexp.sub("\1_lorem_ipsum", line)
       corpus_out.write(line)

Whether this is more efficient or not depends probably on the number of words to search and replace and the frequency of thise words. I don't have that date.

For 5 words and my distribution slower than 3.a

5) if the words to replace are different you might still try to combine the regexps and use a function to replace

   replace_table = {
      "word1": "word1_laram_apsam",
      "word2": "word2_lerem_epsem",
      "word3": "word3_lorom_opsom",

   }

   def repl(match):
      return replace_table[match.group(1)]

   regexp = re.compile(fr'\b({"|".join(dict_keys)})\b')

   for line in corpus_input:
       line = regexp.sub(repl, line)
       corpus_out.write(line)

Slower than 5, whether better than 3.a depends on number of words and wird distribution / frequency.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM