简体繁体 English

如何在没有足够 RAM 的情况下使用 Pandas 打开巨大的镶木地板文件

[英]How to open huge parquet file using Pandas without enough RAM

原文 2020-02-11 03:59:52 9 2 python/ pandas/ parquet/ pyarrow/ fastparquet

I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function.我正在尝试使用 Pandas read_parquet函数将一个相当大的 Parquet 文件（约 2 GB，约 3000 万行）读入我的 Jupyter Notebook（在 Python 3 中）。 I have also installed the pyarrow and fastparquet libraries which the read_parquet function uses as the engine for parquet files.我还安装了pyarrow和fastparquet库， read_parquet函数将它们用作 parquet 文件的引擎。 Unfortunately, it seems that while reading, my computer freezes and eventually I get an error saying it ran out of memory (I don't want to repeat running the code since this will cause another freeze - I don't know the verbatim error message).不幸的是，似乎在阅读时，我的计算机死机，最终我收到一个错误，说它内存不足（我不想重复运行代码，因为这会导致再次冻结 - 我不知道逐字错误消息）。

Is there a good way to somehow write some part of the parquet file to memory without this occurring?有没有一种好方法可以将镶木地板文件的某些部分写入内存而不会发生这种情况？ I know that parquet files are columnar and it may not be possible to store only a part of the records to memory, but I'd like to potentially split it up if there is a workaround or perhaps see if I am doing anything wrong while trying to read this in.我知道镶木地板文件是柱状的，可能无法仅将部分记录存储到内存中，但如果有解决方法，我想将其拆分，或者看看我在尝试时是否做错了什么读入。

I do have a relatively weak computer in terms of specs, with only 6 GB memory and i3.就规格而言，我确实有一台相对较弱的计算机，只有 6 GB 内存和 i3。 The CPU is 2.2 GHz with Turbo Boost available. CPU 为 2.2 GHz，可使用 Turbo Boost。

2 个解决方案

Do you need all the columns?你需要所有的列吗？ You might be able to save memory by just loading the ones you actually use.您也许可以通过加载实际使用的内存来节省内存。

A second possibility is to use an online machine (like google colab ) to load the parquet file and then save it as hdf .第二种可能性是使用在线机器（如google colab ）加载 parquet 文件，然后将其另存为hdf 。 Once you have it, you can use it in chunks.一旦你有了它，你就可以分块使用它。

You can use Dask instead of pandas.您可以使用 Dask 代替 Pandas。 It it is built on pandas, so has similar API that you will likely be familiar with, and is meant for larger data.它建立在 Pandas 之上，因此具有您可能熟悉的类似 API，并且适用于更大的数据。

https://examples.dask.org/dataframes/01-data-access.html https://examples.dask.org/dataframes/01-data-access.html