简体   繁体   中英

Store a Dask DataFrame as a pickle

I have a Dask DataFrame constructed as follows:

import dask.dataframe as dd

df = dd.read_csv('matrix.txt', header=None)
type(df) //dask.dataframe.core.DataFrame

Is there way to save this DataFrame as a pickle?

For example,

df.to_pickle('matrix.pkl')

From a quick inspection of the methods available in dask that is not directly possible. It's still possible to do as the other answer, but I fear that due to the eventual distributed nature of a dask dataframe it might be not straightforward.

Anyhow, if I were you, I'd go through another solution and use parquet as a storage. It offers you basically the same advantages of pickle, and more.

df.to_parquet('my_file.parquet')

Although, if your plan is to use pickle as a 'suspend' method to later on resume computation, saving to parquet would not really help.

My advise would be by far to use parquet. Look at this post where different technologies to store a general pandas dataframe are compared. You'll see that they don't even discuss pickle (which has some issues like it might be incompatible between two python versions). The article is slightly old, and now pandas/dask can directly work with parquet without the need of explicitly using pyarrow .

I guess that you're interested in reading time. There is always a tradeoff between file size and read time. Although in the article is shown that when you factor in multiple core operation you can get similar read performance with compressed parquet file (Parquet-snappy column)

在此处输入图片说明

Thus, I'll repeat myself. Go for parquet file and you'll future-proof yourself. Unless that your use case is very different than columnar/dataframe oriented one.

You can try pickling it like you would do with any other object - import pickle

with open('filename.pickle', 'wb') as handle:
    pickle.dump(df, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('filename.pickle', 'rb') as handle:
    b = pickle.load(handle)
print(a == b)

Further, please check this on safety of pickling dask dataframes and in what situations in might break

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM