简体   繁体   English

将文本数据的大熊猫df保存到磁盘会由于耗尽所有RAM而使Colab崩溃。 有解决方法吗?

[英]Saving large Pandas df of text data to disk crashes Colab due to using up all RAM. Is there a workaround?

I have a very large Pandas dataframe that I would like to save to disk to use later. 我有一个很大的Pandas数据框,我想保存到磁盘以备后用。 The dataframe only contains string data. 数据框仅包含字符串数据。

However, no matter what format I use, the saving process crashes my Google Colab enviroment due to using up all available RAM, except CSV, which doesn't complete even after 5 hours. 但是,无论我使用哪种格式,保存过程都会使我的Google Colab环境崩溃,这是因为用尽了所有可用的RAM,除了CSV以外,CSV甚至在5小时后仍未完成。

but that also crashes the enviroment. 但这也破坏了环境。

Is there a workaround to saving a large text pandas dataframe to disk? 是否有将大文本熊猫数据框保存到磁盘的解决方法?

I have tried to_json , to_feather , to_parquet , to_pickle , and they all crash the enviroment. 我尝试了to_jsonto_featherto_parquetto_pickle ,它们都使环境崩溃。

I also tried to_sql by using 我也尝试通过使用to_sql

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine("sqlite:///database.db")
df.to_sql("table", engine)

I would like to save my dataframe to disk within a reasonable time without crashing the enviroment. 我想在一个合理的时间内将我的数据帧保存到磁盘上而不会导致环境崩溃。

Use the chunksize argument with an appropriate number, eg 使用chunksize参数带有适当的数字,例如

df.to_csv('filename.csv', chunksize=100000)

This tells Python to convert the data into .csv 100000 lines at a time, rather than essentially store an entire second copy of your dataframe in RAM before dumping it to the disk. 这告诉Python将数据一次转换为.csv 100000行,而不是将数据帧的整个第二个副本实质上存储在RAM中,然后再将其转储到磁盘。

Similar for .to_sql : Pandas would write in batches, rather than everything at once. .to_sql相似:熊猫会成批写入,而不是一次性写入。

Instead of using Pandas method "to_csv()", use Dask Dataframe to write the csv file; 而不是使用Pandas方法 “ to_csv()”, 而是使用Dask Dataframe编写 csv文件。 it will be quicker than pandas method. 这将比熊猫方法更快。 Dask write function will break your file into mulitple chuncks and store it. 快捷的写入功能会将您的文件分成多个块并存储。 Code: 码:

#Reading file
import dask.dataframe as dd

df = dd.from_pandas(pd.DataFrame(load_boston().data),npartitions=10)

def operation(df):
   df['new'] = df[0]
   return df[['new']]

#Writing the file
df.pipe(operation).to_csv('boston*.csv')

NOTE: Install Dask package before use: 注意:使用前请安装Dask软件包:

Using Conda: 使用Conda:

conda install -c conda-forge dask

Using pip: 使用点子:

pip install "dask[complete]"    # Install everything

References: 参考文献:

[1] https://docs.dask.org/en/latest/install.html [1] https://docs.dask.org/en/latest/install.html

[2] https://gist.github.com/hussainsultan/f7c2fb9f11008123bda405c5b024a79f [2] https://gist.github.com/hussainsultan/f7c2fb9f11008123bda405c5b024a79f

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用plt.rcParams将y限制设置为零。 (将Pandas df绘制为图表并将图像保存到磁盘) - How to set the y limit using plt.rcParams to zero. (Charting a Pandas df and saving image to disk) 将图像保存到磁盘,而无需使用所有内存 - Save an image to disk without using all the ram 在 google colab 中,当我读取 1.5 GB csv 文件时,它占用 6 GB RAM。 我已经使用 psutil.virtual_memory().available 检查过。 是什么原因? - In google colab when i read 1.5 gb csv file it occupy 6 gb RAM. I have checked using psutil.virtual_memory().available. What is the reason? 我需要通过在硬盘驱动器上存储Python字典来释放RAM,而不是在RAM中。 可能吗? - I need to free up RAM by storing a Python dictionary on the hard drive, not in RAM. Is it possible? 由于过度使用 RAM,Google Colab 崩溃 - Google Colab is crashing due to excessive usage of RAM 有效地将大型Pandas数据帧写入磁盘 - Efficiently writing large Pandas data frames to disk 对于大型文本数据,如何更快地处理 pandas df 列中的文本? - How to make text processing in a pandas df column more faster for large textual data? 使用更少的RAM可以保存数据 - Saving data using less RAM possible 在pandas df中预处理大量文本的更有效方法? - More efficient way to preprocess large amount of text in a pandas df? 将 df 保存到 excel 然后读回 df 后,熊猫日期时间值搞砸了 - Pandas datetime values messed up after saving df to excel and then reading back into a df
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM