简体   繁体   English

删除大文本文件中除 ASCII 可打印字符和中文字符之外的所有字符

[英]Remove all characters except ASCII printable and chinese characters in large text file

I have a 2GB text file, and I would like to clean this file so that it includes ASCII printable and chinese characters (about 10000 characters) only.我有一个 2GB 的文本文件,我想清理这个文件,使其只包含 ASCII 可打印字符和中文字符(大约 10000 个字符)。

I try both codes below, but both of them are very slow.我尝试了下面的两个代码,但它们都非常慢。 Appreciated if any suggestions.如果有任何建议,不胜感激。

chi_char = open(chinese_file,'r',encoding='UTF-8').read()
include = set(string.printable+all_chi_char)

full_text = open(source_file,'r',encoding='UTF-8').read()
output_text = ''.join(ch for ch in full_text if ch in include)
chi_char = open(chinese_file,'r',encoding='UTF-8').read()
include = set(string.printable+all_chi_char)

full_text = open(source_file,'r',encoding='UTF-8').read()
output_text = ''
for ch in full_text:
    if ch in include:
        output_text += ch

First off, are you really sure this is the correct thing to do?首先,你真的确定这是正确的做法吗? way too often, we see people attempt to heuristically clean up their data with random ideas of how to remove cruft rather than fix the problem at the source.很多时候,我们看到人们试图用随机的想法来启发式地清理他们的数据,即如何去除多余的东西,而不是从源头解决问题。 Is there perhaps a way to remove the stuff you don't want earlier in the process, or at least explain to us why your data contains things you don't want it to contain?是否有办法在流程早期删除您不想要的内容,或者至少向我们解释为什么您的数据包含您不希望它包含的内容?

The problem with your current approach is that you load the entire text file into memory at once for no good reason.您当前方法的问题在于,您无缘无故地将整个文本文件一次加载到内存中。 Python probably cannot have all 2GB (plus whatever it requires for its own code and runtime state) in resident memory at once, so the OS swaps out memory regions to disk, only to swap them back in again, repeatedly. Python 可能无法一次在常驻内存中拥有所有 2GB(加上它自己的代码和运行时状态所需的任何内容),因此操作系统将内存区域换出到磁盘,只是再次将它们换回,重复。

Do you need to have the entire resulting text in memory eventually?您最终需要将整个结果文本保存在内存中吗? If not, just read and write one line at a time, then reuse that memory for the next line of text.如果不是,则一次只读写一行,然后将该内存用于下一行文本。

with open(chinese_file,'r',encoding='UTF-8') as all_chi_char:
    include = set(string.printable+all_chi_char.read())

with open(source_file,'r',encoding='UTF-8') as inp, open(dest_file, 'w') as outp:
    for line in inp:
        out_line = []
        for ch in line:
            if ch in include:
                out_line.append(ch)
        outp.write(''.join(out_line))

This could still be improved by using string.maketrans() instead of a homegrown set of characters, but I'm guessing this will already solve the performance problem.这仍然可以通过使用string.maketrans()而不是本地字符set来改进,但我猜这已经解决了性能问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM