Python如何读取巨大的二进制文件（> 25GB）？

Question

I have N-body simulation data and have to read that file in python. 我有N体仿真数据，必须在python中读取该文件。

Its size is over 25GB so file.read() is not work by lack of memory. 它的大小超过25GB，因此由于内存不足，file.read（）无法正常工作。

So I wrote the code like this 所以我写了这样的代码

with open("fullFoF_merger.cbin.z0.Run1", "rb") as mergertree:
    def param(data):
        result = {"nowhid":data[0], "nexthid":data[2],"zi":data[10], 
                  "zip1":data[11], "F":data[4], "mass":data[9], 
                  "dlnM":data[5],"dM":data[12], "dlnJ":data[6],"dJ":data[13],
                  "dlnspin": data[7], "spin":data[8],
                  "G":data[14], "overden":data[15]}
        return result

    num = 0

    while 1:
        num +=1

        binary_data = mergertree.read(4)

        if not binary_data : break

        n_max = struct.unpack('I', binary_data)


        binary_data = mergertree.read(64*n_max[0])

        Halo = [None]*n_max[0]


        for i in range(1,n_max[0]+1):
            data = struct.unpack("4i12f", binary_data[64*(i-1):64*(i)])
            Halo[i-1] = param(data)

        MergerQ = []+Halo


print(MergerQ)

print(num)

print("\n Run time \n --- %d seconds ---" %(time.time()-start_time))

In this process while loop calculate 45470522 times in this code. 在此过程中，while循环在此代码中计算了45470522次。 But when I print MergerQ in python it shows only one dictionary data like this 但是当我在python中打印MergerQ时，它仅显示一个这样的字典数据

[{'nowhid': 53724, 'nexthid': 21912952, 'zi': 0.019874930381774902, 'zip1': -1.6510486602783203e-05, 'F': inf, 'mass': 67336740864.0, 'dlnM': 0.0, 'dM': 0.0, 'dlnJ': 0.1983184665441513, 'dJ': 8463334768640.0, 'dlnspin': 0.19668935239315033, 'spin': 0.012752866372466087, 'G': inf, 'overden': 1.0068886280059814}]

I think it caused by lack of memory or memory limit of python's variables. 我认为这是由于内存不足或python变量的内存限制引起的。

How can I solve this problem? 我怎么解决这个问题？

Is there any way to read whole data and save in python variables? 有什么办法读取整个数据并保存在python变量中？

Parallel computing can be the solution of this code? 并行计算可以解决此代码吗？

I will waiting for your comment. 我将等待您的评论。 Thank you. 谢谢。

Answer 1

This line is your problem: 这行是你的问题：

MergerQ = []+Halo

You clear MergerQ , put it outside of your loop instead: 您清除MergerQ ，然后将其放在循环之外：

num = 0
MergerQ = []

while 1:
    ...
    MergerQ += Halo

But don't expect to have the amount of memory you need to store the entire thing if your file is that big, you'll need a lot of memory and a lot of time. 但是如果文件那么大，不要指望拥有存储整个内容所需的内存量，您将需要大量的内存和大量的时间。

Edit 编辑

Its very possible that you'll be able to successfully run your code without as much physical RAM would be needed as your OS will likely store it in your hard disk fetching it when needed, but this will massively increase run time. 由于操作系统可能会将代码存储在硬盘中并在需要时提取代码，因此很有可能无需大量物理RAM就可以成功运行代码，但这将大大增加运行时间。

Try running this code snippet and seeing what happens ( forewarning: if you leave this running too long your machine will become unresponsive and most likely need physically reset ) 尝试运行此代码段，看看会发生什么（ 警告：如果将此运行时间过长，您的计算机将变得无响应，很可能需要物理重置 ）

a = []
while 1:
    a = [a, a]

Expect your script to react similarly. 期望您的脚本做出类似反应。

Python如何读取巨大的二进制文件（> 25GB）？

问题描述

1 个解决方案

解决方案1
0 2017-05-27 17:55:42

Python如何读取巨大的二进制文件（&gt; 25GB）？

问题描述

1 个解决方案

解决方案1 0 2017-05-27 17:55:42

Python如何读取巨大的二进制文件（> 25GB）？

解决方案1
0 2017-05-27 17:55:42