简体   繁体   English

读取大型二进制文件(超过 500 MB)的最快方法?

[英]Fastest Way to Read Large Binary Files (More than 500 MB)?

I want to read large binary files and split in chunks of 6 bytes.我想读取大型二进制文件并分成 6 个字节的块。 For example, Now I can read 1GB binary file in 82 seconds, but it is so slow.例如,现在我可以在 82 秒内读取 1GB 的二进制文件,但是速度太慢了。 what's the best way to reach maximum speed?达到最大速度的最佳方法是什么?

Note that I can not use struct .请注意,我不能使用struct Because my selected chunks are not power of 2 (6 bytes).因为我选择的块不是 2 的幂(6 个字节)。

with open(file, "rb") as infile:
     data_arr = []
     start = time()
     while True:
         data = infile.read(6)
         if not data: break
         data_arr.append(data)

You've got a few different options.你有几个不同的选择。 Your main problem is that, with the small size of your chunks (6 bytes), there's a lot of overhead spent in fetching the chunk and garbage collecting.您的主要问题是,由于块很小(6 字节),在获取块和垃圾收集方面花费了大量开销。

There's two main ways to get around that:有两种主要方法可以解决这个问题:

  1. Load the entire file into memory, THEN separate it into chunks.将整个文件加载到内存中,然后将其分成块。 This is the fastest method, but the larger your file to more likely it is you will start running into MemoryErrors.这是最快的方法,但您的文件越大,您就越有可能开始遇到 MemoryErrors。

  2. Load one chunk at a time into memory, process it, then move on to the next chunk.一次将一个块加载到内存中,对其进行处理,然后移至下一个块。 This is no faster overall, but saves time up front since you don't need to wait for the entire file to be loaded to start processing.总体而言,这并没有更快,但可以预先节省时间,因为您无需等待整个文件加载完毕即可开始处理。

  3. Experiment with combinations of 1. and 2. (buffering the file in large chunks and separating it into smaller chunks, loading the file in multiples of your chunk size, etc).试验 1. 和 2. 的组合(以大块缓冲文件并将其分成较小的块,以块大小的倍数加载文件等)。 This is left as an exercise for the viewer, as it will take a large amount of experimentation to reach code that will work quickly and correctly.这留给查看者作为练习,因为需要大量的实验才能获得快速正确运行的代码。

Some code, with time comparisons:一些代码,时间比较:

import timeit


def read_original(filename):
    with open(filename, "rb") as infile:
        data_arr = []
        while True:
            data = infile.read(6)
            if not data:
                break
            data_arr.append(data)
    return data_arr


# the bigger the file, the more likely this is to cause python to crash
def read_better(filename):
    with open(filename, "rb") as infile:
        # read everything into memory at once
        data = infile.read()
        # separate string into 6-byte chunks
        data_arr = [data[i:i+6] for i in range(0, len(data), 6)]
    return data_arr

# no faster than the original, but allows you to work on each piece without loading the whole into memory
def read_iter(filename):
    with open(filename, "rb") as infile:
        data = infile.read(6)
        while data:
            yield data
            data = infile.read(6)


def main():
    # 93.8688215 s
    tm = timeit.timeit(stmt="read_original('test/oraociei12.dll')", setup="from __main__ import read_original", number=10)
    print(tm)
    # 85.69337399999999 s
    tm = timeit.timeit(stmt="read_better('test/oraociei12.dll')", setup="from __main__ import read_better", number=10)
    print(tm)
    # 103.0508528 s
    tm = timeit.timeit(stmt="[x for x in read_iter('test/oraociei12.dll')]", setup="from __main__ import read_iter", number=10)
    print(tm)

if __name__ == '__main__':
    main()

This way is much faster.这种方式要快得多。

import sys
from functools import partial

SIX = 6
MULTIPLIER = 30000
SIX_COUNT = SIX * MULTIPLIER

def do(data):
    for chunk in iter(partial(data.read, SIX_COUNT), b""):
        six_list = [ chunk[i:i+SIX] for i in range(0, len(chunk), SIX) ]

if __name__ == "__main__": 
    args = sys.argv[1:]
    for arg in args:
        with open(arg,'rb') as data:
            do(data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM