简体   繁体   English

替换大文本文件中的一组单词

[英]Replacing a set of words in a large text file

I have a large txt file(around 20GB) I want to replace all instances of a list of words from this large file.我有一个大的 txt 文件(大约 20GB)我想替换这个大文件中单词列表的所有实例。 I am struggling to find a way to optimize this code.我正在努力寻找优化此代码的方法。 This is leading to me processing this file for a very long time.这导致我处理这个文件很长时间。

what could I improve?我可以改进什么?

 corpus_input = open(corpus_in,"rt") corpus_out = open(corpus_out,"wt") for line in corpus_input: temp_str=line for word in dict_keys: if word in line: new_word = word+"_lauren_ipsum" temp_str = re.sub(fr'\b{word}\b',new_word,temp_str) else: continue corpus_out.writelines(temp_str) corpus_input.close() corpus_out.close()

The most important thing for optimisation is to understand, what exactly is performing poorly.优化最重要的事情是了解究竟是什么表现不佳。 Then you can see what can be optimized.然后您可以看到可以优化的内容。

If for example reading and writing takes 99% of the time it's not really worth to optimize the processing of your data.例如,如果读取和写入花费了 99% 的时间,那么优化数据处理是不值得的。 Even if you could speed up the processing by 10 you would just gain 0.9% if reading writing were consuming 99%即使您可以将处理速度提高 10 倍,如果读写消耗 99%,您也只会获得 0.9%

I suggest to measure and compare some versions and to post differences in performance.我建议测量和比较一些版本并发布性能差异。 This might lead potential further suggestions to optimise.这可能会导致潜在的进一步优化建议。

In all below examples I replaced writelines with write as writelines is probably decomposing your line character by character prior to writing.在下面的所有示例中,我将writelines替换为write ,因为 writelines 可能会在写入之前逐个字符地分解您的行。

In any case.任何状况之下。 You want to use write You should already gain a speedup of about 5.你想使用write你应该已经获得了大约 5 的加速。

1.) Just reading and writing 1.) 只是阅读和写作

with open(corpus_in,"rt") as corpus_input, open(corpus_out,"wt")
 as corpus_out:
   for line in corpus_input:
       corpus_out.write(line)

2.) Just reading and writing with a bigger buffer 2.) 只是用更大的缓冲区读写

import io

BUF_SIZE = 50 * io.DEFAULT_BUFFER_SIZE # try other buffer sizes if you see an impact
with open(corpus_in,"rt", BUF_SIZE) as corpus_input, open(corpus_out,"wt", BUF_SIZE)
 as corpus_out:
   for line in corpus_input:
corpus_out.write(line)

For me this increases performance by a few percent对我来说,这将性能提高了几个百分点

3.) move search regexp and replacement generation out of loop. 3.) 将搜索正则表达式和替换生成移出循环。

   rules = []
   for word in dict_keys:
       rules.append((re.compile(fr'\b{word}\b'), word + "_lorem_ipsum"))

   for line in corpus_input:
       for regexp, new_word in rules: 
           line = regexp.sub(new_word, line)
       corpus_out.write(line)

On my machine with my frequency of lines containing words this solution is in fact slower then the one having the line if word in line在我的机器上,我的行频率包含单词,这个解决方案实际上比有行if word in line的解决方案慢

So perhaps try: 3.a) move search regexp and replacement generation out of loop.所以也许可以尝试:3.a)将搜索正则表达式和替换生成移出循环。

   rules = []
   for word in dict_keys:
       rules.append((word, re.compile(fr'\b{word}\b'), word + "_lorem_ipsum"))

   for line in corpus_input:
       for word, regexp, new_word in rules: 
           if word in line:
               line = regexp.sub(new_word, line)
       corpus_out.write(line)

3.b) If all replacement strings are longer than the initial strings, then this would be a little faster. 3.b)如果所有替换字符串都比初始字符串长,那么这会快一点。

   rules = []
   for word in dict_keys:
       rules.append((word, re.compile(fr'\b{word}\b'), word + "_lorem_ipsum"))

   for line in corpus_input:
       temp_line = line
       for word, regexp, new_word in rules: 
           if word in line:
               temp_line = regexp.sub(new_word, temp_line)
       corpus_out.write(temp_line)

4.) if you really replace always with word + "_lorem_ipsum" combine the regular expression into one. 4.)如果你真的用word + "_lorem_ipsum"替换总是将正则表达式组合成一个。

   regexp = re.compile(fr'\b({"|".join(dict_keys)})\b')

   for line in corpus_input:
       line = regexp.sub("\1_lorem_ipsum", line)
       corpus_out.write(line)

4.a) depending on the word distribution this might be faster: 4.a)根据单词分布,这可能会更快:

   regexp = re.compile(fr'\b({"|".join(dict_keys)})\b')

   for line in corpus_input:
       if any(word in line for word in dict_keys):
           line = regexp.sub("\1_lorem_ipsum", line)
       corpus_out.write(line)

Whether this is more efficient or not depends probably on the number of words to search and replace and the frequency of thise words.这是否更有效可能取决于要搜索和替换的单词数量以及这些单词的频率。 I don't have that date.我没有那个日期。

For 5 words and my distribution slower than 3.a对于 5 个单词,我的分发速度比 3.a 慢

5) if the words to replace are different you might still try to combine the regexps and use a function to replace 5)如果要替换的单词不同,您仍然可以尝试组合正则表达式并使用 function 替换

   replace_table = {
      "word1": "word1_laram_apsam",
      "word2": "word2_lerem_epsem",
      "word3": "word3_lorom_opsom",

   }

   def repl(match):
      return replace_table[match.group(1)]

   regexp = re.compile(fr'\b({"|".join(dict_keys)})\b')

   for line in corpus_input:
       line = regexp.sub(repl, line)
       corpus_out.write(line)

Slower than 5, whether better than 3.a depends on number of words and wird distribution / frequency.慢于 5,是否优于 3.a 取决于字数和线分布/频率。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM