Pickle dump Pandas DataFrame

Question

This is a question from a lazy man.

I have 4 million rows of pandas DataFrame and would like to save them into smaller chunks of pickle files.

Why smaller chunks? To save/load them quicker.

My question is: 1) Is there a better way (in-built function) to save them in smaller pieces than manually chunking them using np.array_split?

2) Is there any graceful way of gluing them together when I read the chunks other than manually gluing them together?

Please Feel free to suggest any other data type suited for this job other than pickle.

Answer 1

If the goal is to save and load quickly you should look into using sql rather than raw text pickling. If your computer chokes when you ask it to write 4 million rows you can specify a chunk size.

From there you can query slices with std. SQL.

Answer 2

I've been using this for a dataframe of size 7,000,000 x 250

Use hdfs DOCUMENTATION

df = pd.DataFrame(np.random.rand(5, 5))
df

df.to_hdf('myrandomstore.h5', 'this_df', append=False, complib='blosc', complevel=9)

new_df = pd.read_hdf('myrandomstore.h5', 'this_df')
new_df

Pickle dump Pandas DataFrame

Question

2 answers

solution1
4 2016-07-21 22:41:14

solution2
3 ACCPTED 2016-07-21 22:48:42

Pickle dump Pandas DataFrame

Question

2 answers

solution1 4 2016-07-21 22:41:14

solution2 3 ACCPTED 2016-07-21 22:48:42

solution1
4 2016-07-21 22:41:14

solution2
3 ACCPTED 2016-07-21 22:48:42