简体   繁体   中英

If pickling was interrupted, will unpickling necessarily always fail? - Python

Suppose my attempt to write a pickle object out to disk is incomplete due to a crash. Will an attempt to unpickle the object always lead to an exception or is it possible that the fragment that was written out may be interpreted as valid pickle and the error go unnoticed?

Contra the other answers offered, I believe that we can make a strong argument about the recoverability of a pickle. That answer is: "Yes, an incomplete pickle always leads to an exception."

Why are we able to do this? Because the "pickle" format is in fact a small stack-based language. In a stack-based language you write code that pushes item after item on a stack, then invoke an operator that does something with the data you've accumulated. And it just so happens that a pickle has to end with the command ".", which says: "take the item now at the bottom of the stack and return it as the value of this pickle." If your pickle is chopped off early, it will not end with this command, and you will get an EOF error.

If you want to try recovering some of the data, you might have to write your own interpreter, or call into pickle.py somewhere that gets around its wanting to raise EOFError when done interpreting the stack without finding a ".". The main thing to keep in mind is that, as in most stack-based languages, big data structures are built "backwards": first you put lots of little strings or numbers on the stack, then you invoke an operation that says "put those together into a list" or "grab pairs of items on the stack and make a dictionary". So, if a pickle is interrupted, you'll find the stack full of pieces of the object that was going to be built, but you'll be missing that final code that tells you what was going to be built from the pieces.

Pickling an object returns an str object, or writes an str object to a file ... it doesn't modify the original object. If a "crash" (exception) happens inside a pickling call, the result won't be returned to the caller, so you don't have anything that you could try to unpickle. Besides, why would you want to unpickle some dud rubbish left over after an exception?

This is a development of S. Lott's answer, with my suggestion: Append a hash or checksum to your data, that you check before unpickling again.

Here is a (simple) implementation of safepickle/safeunpickle to show how you can pad the pickled data with a hash (cryptographically strong hash in this case):

import hashlib
import cPickle as pickle

_HASHLEN = 20

def safepickle(obj):
    s = pickle.dumps(obj)
    s += hashlib.sha1(s).digest()
    return s

def safeunpickle(pstr):
    data, checksum = pstr[:-_HASHLEN], pstr[-_HASHLEN:]
    if hashlib.sha1(data).digest() != checksum:
        raise ValueError("Pickle hash does not match!")
    return pickle.loads(data)


l = range(20)
p = safepickle(l)
new_l = safeunpickle(p)
print new_l == l

This method is to ensure that what you unpickle matches what you pickled and wrote to disk previously, but it does not protect against mixing up different pickles or malicious attacks, of course.

(This method can be generalized to the pattern safe_write_file and safe_read_file for any whole-file data.)

I doubt you could make a claim that it will always lead to an exception. Pickles are actually programs written in a specialized stack language. The internal details of pickles change from version to version, and new pickle protocols are added occasionally. The state of the pickle after a crash, and the resulting effects on the unpickler, would be very difficult to summarize in a simple statement like "it will always lead to an exception".

To be sure that you have a "complete" pickle file, you need to pickle three things.

  1. Pickle a header of some kind that claims how many objects and what the end-of-file flag will look like. A tuple of an integer and the EOF string, for example.

  2. Pickle the objects you actually care about. The count is given by the header.

  3. Pickle a tail object that you don't actually care about, but which simply matches the claim made in the header. This can be simply a string that matches what was in the header.

When you unpickle this file, you have to unpickle three things:

  1. The header. You care about the count and the form of the tail.

  2. The objects you actually care about.

  3. The tail object. Check that it matches the header. Other than that, it doesn't convey much except that the file was written in it's entirety.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM