简体   繁体   中英

Remove all characters except ASCII printable and chinese characters in large text file

I have a 2GB text file, and I would like to clean this file so that it includes ASCII printable and chinese characters (about 10000 characters) only.

I try both codes below, but both of them are very slow. Appreciated if any suggestions.

chi_char = open(chinese_file,'r',encoding='UTF-8').read()
include = set(string.printable+all_chi_char)

full_text = open(source_file,'r',encoding='UTF-8').read()
output_text = ''.join(ch for ch in full_text if ch in include)
chi_char = open(chinese_file,'r',encoding='UTF-8').read()
include = set(string.printable+all_chi_char)

full_text = open(source_file,'r',encoding='UTF-8').read()
output_text = ''
for ch in full_text:
    if ch in include:
        output_text += ch

First off, are you really sure this is the correct thing to do? way too often, we see people attempt to heuristically clean up their data with random ideas of how to remove cruft rather than fix the problem at the source. Is there perhaps a way to remove the stuff you don't want earlier in the process, or at least explain to us why your data contains things you don't want it to contain?

The problem with your current approach is that you load the entire text file into memory at once for no good reason. Python probably cannot have all 2GB (plus whatever it requires for its own code and runtime state) in resident memory at once, so the OS swaps out memory regions to disk, only to swap them back in again, repeatedly.

Do you need to have the entire resulting text in memory eventually? If not, just read and write one line at a time, then reuse that memory for the next line of text.

with open(chinese_file,'r',encoding='UTF-8') as all_chi_char:
    include = set(string.printable+all_chi_char.read())

with open(source_file,'r',encoding='UTF-8') as inp, open(dest_file, 'w') as outp:
    for line in inp:
        out_line = []
        for ch in line:
            if ch in include:
                out_line.append(ch)
        outp.write(''.join(out_line))

This could still be improved by using string.maketrans() instead of a homegrown set of characters, but I'm guessing this will already solve the performance problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM