简体   繁体   中英

Efficient way to parse through huge file

I have to parse through a really big file, modify its contents, and write that to another file. The file that I have right now is not that big in comparison to what it could be, but it's big nonetheless.

The file is 1.3 GB and contains about 7 million lines of this format:

8823192\t/home/pcastr/...

Where \\t is a tab character. The number at the beginning is the apparent size of the path that follows.

I want an output file with lines looking like this (in csv format):

True,8823192,/home/pcastr/...

Where the first value is whether the path is a directory.

Currently, my code looks something like this:

with open(filepath, "r") as open_file:
    while True:
        line = open_file.readline()
        if line == "":  # Checks for the end of the file
            break
        size = line.split("\t")[0]
        path = line.strip().split("\t")[1]
        is_dir = os.path.isdir(path)

        streamed_file.write(unicode("{isdir},{size},{path}\n".format(isdir=is_dir, size=size, path=path))

A caveat with this is that files like this WILL get tremendously big, so I not only need a fast solution, but a memory efficient solution as well. I know that there is usually a trade off between these two qualities,

The biggest gain is likely to come from calling split only once per line

size, path = line.strip().split("\t")
# or ...split("\t", 3)[0:2] if there are extra fields to ignore

You can at least simplify your code by treating the input file as an iterator and using the csv module. This might give you a speed-up as well, as it eliminates the need for an explicit call to split :

with open(filepath, "r") as open_file:
    reader = csv.reader(open_file, delimiter="\t")
    writer = csv.writer(streamed_file)
    for size, path in reader:
       is_dir = os.path.isdir(path)
       writer.writerow([is_dir, size, path])

You might need mmap . Introduction and tutorial here .

As a simplification, it means you can treat files on disk as if they were in RAM, without actually reading the whole file into RAM.

Compressing the file before copying trough the network could speed up the processing of data because you will get your data to your script faster.

Can you keep the input text file compressed on the remote target system? if yes, you could compress it to a format using an algorithm that is supported in python (modules zlib, gzip, bz2, lzma, zipfile)

If no you could at least run a script on remote storage system to compress the file. Next you would read the file and decompress it in memory using one of the python modules and then process each line.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM