将文本数据的大熊猫df保存到磁盘会由于耗尽所有RAM而使Colab崩溃。有解决方法吗？

Question

I have a very large Pandas dataframe that I would like to save to disk to use later. 我有一个很大的Pandas数据框，我想保存到磁盘以备后用。 The dataframe only contains string data. 数据框仅包含字符串数据。

However, no matter what format I use, the saving process crashes my Google Colab enviroment due to using up all available RAM, except CSV, which doesn't complete even after 5 hours. 但是，无论我使用哪种格式，保存过程都会使我的Google Colab环境崩溃，这是因为用尽了所有可用的RAM，除了CSV以外，CSV甚至在5小时后仍未完成。

but that also crashes the enviroment. 但这也破坏了环境。

Is there a workaround to saving a large text pandas dataframe to disk? 是否有将大文本熊猫数据框保存到磁盘的解决方法？

I have tried to_json , to_feather , to_parquet , to_pickle , and they all crash the enviroment. 我尝试了to_json ， to_feather ， to_parquet ， to_pickle ，它们都使环境崩溃。

I also tried to_sql by using 我也尝试通过使用to_sql

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine("sqlite:///database.db")
df.to_sql("table", engine)

I would like to save my dataframe to disk within a reasonable time without crashing the enviroment. 我想在一个合理的时间内将我的数据帧保存到磁盘上而不会导致环境崩溃。

Answer 1

Use the chunksize argument with an appropriate number, eg 使用chunksize参数带有适当的数字，例如

df.to_csv('filename.csv', chunksize=100000)

This tells Python to convert the data into .csv 100000 lines at a time, rather than essentially store an entire second copy of your dataframe in RAM before dumping it to the disk. 这告诉Python将数据一次转换为.csv 100000行，而不是将数据帧的整个第二个副本实质上存储在RAM中，然后再将其转储到磁盘。

Similar for .to_sql : Pandas would write in batches, rather than everything at once. 与.to_sql相似：熊猫会成批写入，而不是一次性写入。

Answer 2

Instead of using Pandas method "to_csv()", use Dask Dataframe to write the csv file; 而不是使用Pandas方法 “ to_csv（）”， 而是使用Dask Dataframe编写 csv文件。 it will be quicker than pandas method. 这将比熊猫方法更快。 Dask write function will break your file into mulitple chuncks and store it. 快捷的写入功能会将您的文件分成多个块并存储。 Code: 码：

#Reading file
import dask.dataframe as dd

df = dd.from_pandas(pd.DataFrame(load_boston().data),npartitions=10)

def operation(df):
   df['new'] = df[0]
   return df[['new']]

#Writing the file
df.pipe(operation).to_csv('boston*.csv')

NOTE: Install Dask package before use: 注意：使用前请安装Dask软件包：

Using Conda: 使用Conda：

conda install -c conda-forge dask

Using pip: 使用点子：

pip install "dask[complete]"    # Install everything

References: 参考文献：

[1] https://docs.dask.org/en/latest/install.html [1] https://docs.dask.org/en/latest/install.html

[2] https://gist.github.com/hussainsultan/f7c2fb9f11008123bda405c5b024a79f [2] https://gist.github.com/hussainsultan/f7c2fb9f11008123bda405c5b024a79f

将文本数据的大熊猫df保存到磁盘会由于耗尽所有RAM而使Colab崩溃。有解决方法吗？

问题描述

2 个解决方案

解决方案1
1 2019-05-29 03:52:26

解决方案2
1 2019-05-29 04:01:09

将文本数据的大熊猫df保存到磁盘会由于耗尽所有RAM而使Colab崩溃。 有解决方法吗？

问题描述

2 个解决方案

解决方案1 1 2019-05-29 03:52:26

解决方案2 1 2019-05-29 04:01:09

将文本数据的大熊猫df保存到磁盘会由于耗尽所有RAM而使Colab崩溃。有解决方法吗？

解决方案1
1 2019-05-29 03:52:26

解决方案2
1 2019-05-29 04:01:09