简体   繁体   中英

Fastest Way to Read Large Binary Files (More than 500 MB)?

I want to read large binary files and split in chunks of 6 bytes. For example, Now I can read 1GB binary file in 82 seconds, but it is so slow. what's the best way to reach maximum speed?

Note that I can not use struct . Because my selected chunks are not power of 2 (6 bytes).

with open(file, "rb") as infile:
     data_arr = []
     start = time()
     while True:
         data = infile.read(6)
         if not data: break
         data_arr.append(data)

You've got a few different options. Your main problem is that, with the small size of your chunks (6 bytes), there's a lot of overhead spent in fetching the chunk and garbage collecting.

There's two main ways to get around that:

  1. Load the entire file into memory, THEN separate it into chunks. This is the fastest method, but the larger your file to more likely it is you will start running into MemoryErrors.

  2. Load one chunk at a time into memory, process it, then move on to the next chunk. This is no faster overall, but saves time up front since you don't need to wait for the entire file to be loaded to start processing.

  3. Experiment with combinations of 1. and 2. (buffering the file in large chunks and separating it into smaller chunks, loading the file in multiples of your chunk size, etc). This is left as an exercise for the viewer, as it will take a large amount of experimentation to reach code that will work quickly and correctly.

Some code, with time comparisons:

import timeit


def read_original(filename):
    with open(filename, "rb") as infile:
        data_arr = []
        while True:
            data = infile.read(6)
            if not data:
                break
            data_arr.append(data)
    return data_arr


# the bigger the file, the more likely this is to cause python to crash
def read_better(filename):
    with open(filename, "rb") as infile:
        # read everything into memory at once
        data = infile.read()
        # separate string into 6-byte chunks
        data_arr = [data[i:i+6] for i in range(0, len(data), 6)]
    return data_arr

# no faster than the original, but allows you to work on each piece without loading the whole into memory
def read_iter(filename):
    with open(filename, "rb") as infile:
        data = infile.read(6)
        while data:
            yield data
            data = infile.read(6)


def main():
    # 93.8688215 s
    tm = timeit.timeit(stmt="read_original('test/oraociei12.dll')", setup="from __main__ import read_original", number=10)
    print(tm)
    # 85.69337399999999 s
    tm = timeit.timeit(stmt="read_better('test/oraociei12.dll')", setup="from __main__ import read_better", number=10)
    print(tm)
    # 103.0508528 s
    tm = timeit.timeit(stmt="[x for x in read_iter('test/oraociei12.dll')]", setup="from __main__ import read_iter", number=10)
    print(tm)

if __name__ == '__main__':
    main()

This way is much faster.

import sys
from functools import partial

SIX = 6
MULTIPLIER = 30000
SIX_COUNT = SIX * MULTIPLIER

def do(data):
    for chunk in iter(partial(data.read, SIX_COUNT), b""):
        six_list = [ chunk[i:i+SIX] for i in range(0, len(chunk), SIX) ]

if __name__ == "__main__": 
    args = sys.argv[1:]
    for arg in args:
        with open(arg,'rb') as data:
            do(data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM