Python读取大二进制文件的一部分

Question

I have large binary file ( size ~2.5Gb ). 我有一个很大的二进制文件（ 大小为〜2.5Gb ）。 It contains header (size 336 byte) and seismic signal data (x, y and z channels) with type int32. 它包含标头（大小为336字节）和类型为int32的地震信号数据（x，y和z通道）。 Count of discrete is 223 200 000. I need read part of signal. 离散数量为223200000。我需要读取部分信号。 For example, I want get part of signal in interval of discrete [216 000 000, 219 599 999]. 例如，我要以离散[216 000 000，219 599 999]的间隔获取信号的一部分。 I wrote the function: 我写了函数：

def reading(path, start_moment, end_moment):
    file_data = open(path, 'rb')
    if start_moment is not None:
        bytes_value = start_moment * 4 * 3
        file_data.seek(336 + bytes_value)
    else:
        file_data.seek(336)

    if end_moment is None:
        try:
            signals = np.fromfile(file_data, dtype=np.int32)
        except MemoryError:
            return None
        finally:
            file_data.close()
    else:
        moment_count = end_moment - start_moment + 1
        try:
            signals = np.fromfile(file_data, dtype=np.int32,
                                  count=moment_count * 3)
        except MemoryError:
            return None
        finally:
            file_data.close()
    channel_count = 3
    signal_count = signals.shape[0] // channel_count
    signals = np.reshape(signals, newshape=(signal_count, channel_count))
    return signals

If I run script with the function in PyCharm IDE I get error: 如果我在PyCharm IDE中使用该函数运行脚本，则会收到错误消息：

Traceback (most recent call last): File "D:/AppsBuilding/test/testReadBaikal8.py", line 41, in signal_2 = reading(path=path, start_moment=216000000, end_moment=219599999) File "D:/AppsBuilding/test/testReadBaikal8.py", line 27, in reading count=moment_count * 3) OSError: obtaining file position failed 追溯（最近一次通话最近）：文件“ D：/AppsBuilding/test/testReadBaikal8.py”，第41行，信号_2 =读取（路径=路径，start_moment = 216000000，end_moment = 219599999）文件“ D：/ AppsBuilding / test /testReadBaikal8.py“，第27行，读取计数= moment_count * 3）OSError：获取文件位置失败

But if I run script with parameters: start_moment=7200000, end_moment=10799999 all ok. 但是，如果我运行带有参数的脚本：start_moment = 7200000，end_moment = 10799999都可以。 On my PC was installed Windows7 32bit. 在我的PC上安装了Windows7 32bit。 Memory size is 1.95Gb Please, help me resolve this problem. 内存大小为1.95Gb，请帮助我解决此问题。

Answer 1

Divide the file into small segments, freeing memory after each small piece of content is processed 将文件分成小段，在处理每小段内容后释放内存

def read_in_block(file_path):
    BLOCK_SIZE = 1024
    with open(file_path, "r") as f:
        while True:
            block = f.read(BLOCK_SIZE)  
            if block:
                yield block
            else:
                return  

        print block

Answer 2

I don't use Numpy but I don't see anything obviously wrong with your code. 我不使用Numpy，但是您的代码没有任何明显错误的地方。 However, you say the file is approximately 2.5 GB in size. 但是，您说该文件的大小约为2.5 GB。 A triplet index of 219,599,999 requires a file at least 2.45 GB in size: 三重索引219,599,999要求文件大小至少为2.45 GB：

$ calc
; 219599999 * 4 * 3
    2635199988
; 2635199988 / 1024^3
    ~2.45422123745083808899

Are you sure your file is really that large? 您确定文件真的那么大吗？

I also don't use MS Windows but the following toy programs work for me. 我也不使用MS Windows，但是以下玩具程序对我有用。 The first creates a data file that mimics the structure of yours. 第一个创建一个模仿您的结构的数据文件。 The second shows that it can read the final data triplet. 第二个表明它可以读取最终数据三元组。 What happens if you run these on your system? 如果在系统上运行它们会怎样？

fh = open('x', 'wb')
fh.write(b'0123456789')
for i in range(0, 1000):
    s = bytes('{:03d}'.format(i), 'ascii')
    fh.write(b'a' + s + b'b' + s + b'c' + s)

Read the data from file x : 从文件x读取数据：

fh = open('x', 'rb')
triplet = 999
fh.seek(10 + triplet * 3 * 4)
data = fh.read(3 * 4)
print(data)

Python读取大二进制文件的一部分

问题描述

2 个解决方案

解决方案1
0 2018-07-28 10:54:17

解决方案2
0 2018-07-29 21:19:20

Python读取大二进制文件的一部分

问题描述

2 个解决方案

解决方案1 0 2018-07-28 10:54:17

解决方案2 0 2018-07-29 21:19:20

解决方案1
0 2018-07-28 10:54:17

解决方案2
0 2018-07-29 21:19:20