which is faster for load: pickle or hdf5 in python

Question

Given a 1.5 Gb list of pandas dataframes, which format is fastest for loading compressed data : pickle (via cPickle), hdf5, or something else in Python?

I only care about fastest speed to load the data into memory
I don't care about dumping the data, it's slow but I only do this once.
I don't care about file size on disk

Answer 1

I would consider only two storage formats: HDF5 (PyTables) and Feather

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

Comparison for the following storage formats: ( CSV , CSV.gzip , Pickle , HDF5 [various compression]):

                  read_s  write_s  size_ratio_to_CSV
storage
CSV               17.900    69.00              1.000
CSV.gzip          18.900   186.00              0.047
Pickle             0.173     1.77              0.374
HDF_fixed          0.196     2.03              0.435
HDF_tab            0.230     2.60              0.437
HDF_tab_zlib_c5    0.845     5.44              0.035
HDF_tab_zlib_c9    0.860     5.95              0.035
HDF_tab_bzip2_c5   2.500    36.50              0.011
HDF_tab_bzip2_c9   2.500    36.50              0.011

But it might be different for you, because all my data was of the datetime dtype, so it's always better to make such a comparison with your real data or at least with the similar data...

which is faster for load: pickle or hdf5 in python

Question

1 answers

solution1
73 ACCPTED 2016-06-20 18:04:25

which is faster for load: pickle or hdf5 in python

Question

1 answers

solution1 73 ACCPTED 2016-06-20 18:04:25

solution1
73 ACCPTED 2016-06-20 18:04:25