简体   繁体   English

如何在python中打开镶木地板(二进制数据类型)文件而不会出现RAM错误?

[英]How to open parquet (binary data type) files in python without getting RAM error?

I converted some CSV data to parquet and was able to reduce the storage volume from 2,5 GB to 450 MB. 我将一些CSV数据转换为实木复合地板,并能够将存储容量从2.5 GB减少到450 MB。 I use following code to open the parquet file: 我使用以下代码打开镶木地板文件:

df= pd.read_parquet("PATH/file9.parquet", engine='auto')

My problem is that I get the following error while I try to open the parquet file. 我的问题是,当我尝试打开实木复合地板文件时出现以下错误。

pyarrow.lib.ArrowIOError: Arrow error: Out of memory: malloc of size 2941974336 failed

I know that its possible to open big csv files by chunking them as follow: 我知道可以通过按以下方式分块来打开大型csv文件:

for chunk in pd.read_csv("PATH/file9.csv", chunksize=chunksize):

It was possible to open smaller parquet files with that line. 可以用该行打开较小的实木复合地板文件。 But I couldn't find any solution for opening big parquet files. 但是我找不到打开大木地板文件的任何解决方案。 Can anyone maybe recommend another data type which is compact as parquet and can be opend without problem? 谁能推荐另一个像镶木地板一样紧凑并且可以毫无问题地打开的数据类型? Or how to chunkthe parquet file? 或者如何对镶木地板文件进行分块?

The underlying read call does not support any sort of chunking parameter, so unfortunately no, you can't read a Parquet file in a piecewise way, not with that library anyway. 底层的读取调用不支持任何类型的块参数,因此不幸的是,不能,您不能以分段方式读取Parquet文件,无论如何都不能使用该库。

If you don't need all of the columns, though, you can pass in the columns=(...) kwarg. 但是,如果不需要所有列,则可以传递columns=(...) kwarg。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM