简体   繁体   中英

which is faster for load: pickle or hdf5 in python

Given a 1.5 Gb list of pandas dataframes, which format is fastest for loading compressed data : pickle (via cPickle), hdf5, or something else in Python?

  • I only care about fastest speed to load the data into memory
  • I don't care about dumping the data, it's slow but I only do this once.
  • I don't care about file size on disk

I would consider only two storage formats: HDF5 (PyTables) and Feather

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

Comparison for the following storage formats: ( CSV , CSV.gzip , Pickle , HDF5 [various compression]):

                  read_s  write_s  size_ratio_to_CSV
storage
CSV               17.900    69.00              1.000
CSV.gzip          18.900   186.00              0.047
Pickle             0.173     1.77              0.374
HDF_fixed          0.196     2.03              0.435
HDF_tab            0.230     2.60              0.437
HDF_tab_zlib_c5    0.845     5.44              0.035
HDF_tab_zlib_c9    0.860     5.95              0.035
HDF_tab_bzip2_c5   2.500    36.50              0.011
HDF_tab_bzip2_c9   2.500    36.50              0.011

But it might be different for you, because all my data was of the datetime dtype, so it's always better to make such a comparison with your real data or at least with the similar data...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM