简体   繁体   中英

Removing a small number of lines from a large file

I have a very large text file, where most of the lines are composed of ASCII characters, but a small fraction of lines have non-ASCII characters. What is the fastest way to create a new text file containing only the ASCII lines? Right now I am checking each character in each line to see if it's ASCII, and writing each line to the new file if all the characters are ASCII, but this method is rather slow. Also, I am using Python, but would be open to using other languages in the future.

Edit: updated with code

#!/usr/bin/python

import string

def isAscii(s):
    for c in s:
        if ord(c) > 127 or ord(c) < 0:
            return False
    return True

f = open('data.tsv')
g = open('data-ASCII-only.tsv', 'w')

linenumber = 1
for line in f:
    if isAscii(line):
        g.write(line)
    linenumber += 1

f.close()
g.close()

You can use grep: "-v" keeps the opposite, -P uses perl regex syntax, and [\\x80-\\xFF] is the character range for non-ascii.

grep -vP "[\x80-\xFF]" data.tsv > data-ASCII-only.tsv

See this question How do I grep for all non-ASCII characters in UNIX for more about search for ascii characters with grep.

The following suggestion uses a command-line filter (ie, you would use it on the shell command line), this example works in a shell on linux or unix systems, maybe OSX too (I've heard OSX is BSDish):

$ cat big_file | tr -dc '\000-\177' > big_file_ascii_only

It uses the "tr" (translate) filter. In this case, we are telling tr to "delete" all characters which are outside the range octal-000 to octal-177. You may wish to tweak the charcter set - check the man page for tr to get some ideas on other ways to specify the characters you want to keep (or delete)

The other approaches given will work if, and only if, the file is encoded in such a way that "non-ASCII" is equivalent to "high bit set", such as Latin-1 or UTF-8. Here's a program in Python 3 that will work with any encoding.

#!/usr/bin/env python3

import codecs

in_fname = "utf16file"
in_encoding = "utf-16"
out_fname = "ascii_lines"
out_encoding = "ascii"

def is_ascii(s):
    try:
        s.encode("ascii")
    except UnicodeEncodeError:
        return False
    return True

f_in = codecs.open(in_fname, "r", in_encoding)
f_out = codecs.open(out_fname, "w", out_encoding)

for s in f_in:
    if is_ascii(s):
        f_out.write(s)

f_in.close()
f_out.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM