简体   繁体   English

Python如何读取巨大的二进制文件(> 25GB)?

[英]Python how can I read huge binary file(>25GB)?

I have N-body simulation data and have to read that file in python. 我有N体仿真数据,必须在python中读取该文件。

Its size is over 25GB so file.read() is not work by lack of memory. 它的大小超过25GB,因此由于内存不足,file.read()无法正常工作。

So I wrote the code like this 所以我写了这样的代码

with open("fullFoF_merger.cbin.z0.Run1", "rb") as mergertree:
    def param(data):
        result = {"nowhid":data[0], "nexthid":data[2],"zi":data[10], 
                  "zip1":data[11], "F":data[4], "mass":data[9], 
                  "dlnM":data[5],"dM":data[12], "dlnJ":data[6],"dJ":data[13],
                  "dlnspin": data[7], "spin":data[8],
                  "G":data[14], "overden":data[15]}
        return result

    num = 0

    while 1:
        num +=1

        binary_data = mergertree.read(4)

        if not binary_data : break

        n_max = struct.unpack('I', binary_data)


        binary_data = mergertree.read(64*n_max[0])

        Halo = [None]*n_max[0]


        for i in range(1,n_max[0]+1):
            data = struct.unpack("4i12f", binary_data[64*(i-1):64*(i)])
            Halo[i-1] = param(data)

        MergerQ = []+Halo


print(MergerQ)

print(num)

print("\n Run time \n --- %d seconds ---" %(time.time()-start_time))

In this process while loop calculate 45470522 times in this code. 在此过程中,while循环在此代码中计算了45470522次。 But when I print MergerQ in python it shows only one dictionary data like this 但是当我在python中打印MergerQ时,它仅显示一个这样的字典数据

[{'nowhid': 53724, 'nexthid': 21912952, 'zi': 0.019874930381774902, 'zip1': -1.6510486602783203e-05, 'F': inf, 'mass': 67336740864.0, 'dlnM': 0.0, 'dM': 0.0, 'dlnJ': 0.1983184665441513, 'dJ': 8463334768640.0, 'dlnspin': 0.19668935239315033, 'spin': 0.012752866372466087, 'G': inf, 'overden': 1.0068886280059814}]

I think it caused by lack of memory or memory limit of python's variables. 我认为这是由于内存不足或python变量的内存限制引起的。

How can I solve this problem? 我怎么解决这个问题?

Is there any way to read whole data and save in python variables? 有什么办法读取整个数据并保存在python变量中?

Parallel computing can be the solution of this code? 并行计算可以解决此代码吗?

I will waiting for your comment. 我将等待您的评论。 Thank you. 谢谢。

This line is your problem: 这行是你的问题:

MergerQ = []+Halo

You clear MergerQ , put it outside of your loop instead: 您清除MergerQ ,然后将其放在循环之外:

num = 0
MergerQ = []

while 1:
    ...
    MergerQ += Halo

But don't expect to have the amount of memory you need to store the entire thing if your file is that big, you'll need a lot of memory and a lot of time. 但是如果文件那么大,不要指望拥有存储整个内容所需的内存量,您将需要大量的内存和大量的时间。

Edit 编辑

Its very possible that you'll be able to successfully run your code without as much physical RAM would be needed as your OS will likely store it in your hard disk fetching it when needed, but this will massively increase run time. 由于操作系统可能会将代码存储在硬盘中并在需要时提取代码,因此很有可能无需大量物理RAM就可以成功运行代码,但这将大大增加运行时间。

Try running this code snippet and seeing what happens ( forewarning: if you leave this running too long your machine will become unresponsive and most likely need physically reset ) 尝试运行此代码段,看看会发生什么( 警告:如果将此运行时间过长,您的计算机将变得无响应,很可能需要物理重置

a = []
while 1:
    a = [a, a]

Expect your script to react similarly. 期望您的脚本做出类似反应。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 打开25GB文本文件进行处理 - Opening a 25GB text file for processing 以编程方式将 2TB 的各种大小的文件打包到 25GB 的文件夹中? (我使用python,任何语言都可以接受) - Programmatically packing 2TB of various sized files into folders of 25GB? (I used python, any language will be acceptable) 高效 XML 解析 25GB 数据 - Efficient XML parsing for 25GB data 我的代码使用了超过 25GB 的内存和崩溃 - My Code uses over 25GB of Ram and Crashes 处理大文件 (20GB+) 时,如何在 python 中更快地进行文件解析和 I/O - How can I make file parsing and I/O faster in python when working with huge files (20GB+) python:如何读取和处理18GB的csv文件? - python: how can I read and process a 18GB csv file? 如何在 python 中读取包含 xml 的二进制文件? - How can I read a binary file that contains xml in python? Colab 在 12GB 内存崩溃后不要求 25GB 内存 - Colab not asking for 25GB ram after 12GB ram crashed 无法在PySpark本地模式下加载25GB数据集,并且56GB RAM可用 - Unable to load 25GB dataset in PySpark local mode with 56GB RAM free 如何从python中读取这个二进制文件,提供二进制文件,文本文件和代码? - How can I read this binary file from python, with binary file, text file, and code presented?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM