简体   繁体   中英

Why does python pickle load and dump inflate the size of of an object on disk?

I have a pickled object in a file named b1.pkl:

$ ls -l b*
-rw-r--r--  1 fireball  staff  64743950 Oct 11 15:32 b1.pkl

Then I run the following python code to load the object and dump it to a new file:

import numpy as np
import cPickle as pkl

fin = open('b1.pkl', 'r')
fout = open('b2.pkl', 'w')

x = pkl.load(fin)
pkl.dump(x, fout)

fin.close()
fout.close()

The file this code creates is more than twice as large:

$ ls -l b*
-rw-r--r--  1 fireball  staff   64743950 Oct 11 15:32 b1.pkl
-rw-r--r--  1 fireball  staff  191763914 Oct 11 15:47 b2.pkl

Can anyone explain why the new file is so much larger than the original one? It should contain exactly the same structure.

It could be that the original pickle used some other protocol. For example try specifying protocol=2 as a keyword argument to the second pickle.dump and test it again. Binary pickle should be much smaller in size.

Most likely your original b1.pkl was pickled out using the more efficient protocol mode (1 or 2). So your file starts out smaller.

When you load in with cPickle, it will automatically detect the protocol for you from the file. But when you go and dump it out again with default args, it will use protocol 0 which is much larger. It does this for portability/compatibility. You are required to explicitly request the binary protocol.

import numpy as np
import cPickle

# random data
s = {}
for i in xrange(5000):
    s[i] = np.random.randn(5,5)

# pickle it out the first time with binary protocol
with open('data.pkl', 'wb') as f:
    cPickle.dump(s, f, 2)

# read it back in and pickle it out with default args
with open('data.pkl', 'rb') as f:
    with open('data2.pkl', 'wb') as o:
        s = cPickle.load(f)
        cPickle.dump(s, o)

$ ls -l
1174109 Oct 11 16:05 data.pkl
3243157 Oct 11 16:08 data2.pkl

pkl.dump(x, fout, 2) would probably result in the same filesize. Not specifying protocol version will make pickle use the old version 0.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM