简体   繁体   中英

Lightweight pickle for basic types in python?

All I want to do is serialize and unserialize tuples of strings or ints.

I looked at pickle.dumps() but the byte overhead is significant. Basically it looks like it takes up about 4x as much space as it needs to. Besides, all I need is basic types and have no need to serialize objects.

marshal is a little better in terms of space but the result is full of nasty \\x00 bytes. Ideally I would like the result to be human readable.

I thought of just using repr() and eval(), but is there a simple way I could accomplish this without using eval()?

This is getting stored in a db, not a file. Byte overhead matters because it could make the difference between requiring a TEXT column versus a varchar, and generally data compactness affects all areas of db performance.

Take a look at json , at least the generated dumps are readable with many other languages.

JSON (JavaScript Object Notation) http://json.org is a subset of JavaScript syntax (ECMA-262 3rd edition) used as a lightweight data interchange format.

personally i would use yaml . it's on par with json for encoding size, but it can represent some more complex things (eg classes, recursive structures) when necessary.

In [1]: import yaml
In [2]: x = [1, 2, 3, 'pants']
In [3]: print(yaml.dump(x))
[1, 2, 3, pants]

In [4]: y = yaml.load('[1, 2, 3, pants]')
In [5]: y
Out[5]: [1, 2, 3, 'pants']

Maybe you're not using the right protocol:

>>> import pickle
>>> a = range(1, 100)
>>> len(pickle.dumps(a))
492
>>> len(pickle.dumps(a, pickle.HIGHEST_PROTOCOL))
206

See the documentation for pickle data formats .

If you need a space efficient solution you can use Google Protocol buffers.

Protocol buffers - Encoding

Protocol buffers - Python Tutorial

There are some persistence builtins mentioned in the python documentation but I don't think any of these is remarkable smaller in the produced filesize.

You could alway use the configparser but there you only get string, int, float, bool.

"the byte overhead is significant"

Why does this matter? It does the job. If you're running low on disk space, I'd be glad to sell you a 1Tb for $500.

Have you run it? Is performance a problem? Can you demonstrate that the performance of serialization is the problem?

"I thought of just using repr() and eval(), but is there a simple way I could accomplish this without using eval()?"

Nothing simpler than repr and eval.

What's wrong with eval?

Is is the "someone could insert malicious code into the file where I serialized my lists" issue?

Who -- specifically -- is going to find and edit this file to put in malicious code? Anything you do to secure this (ie, encryption) removes "simple" from it.

Luckily there is solution which uses COMPRESSION, and solves the general problem involving any arbitrary Python object including new classes. Rather than micro-manage mere tuples sometimes it's better to use a DRY tool.
Your code will be more crisp and readily refactored in similar future situations.

y_serial.py module :: warehouse Python objects with SQLite

"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."

http://yserial.sourceforge.net

[If you are still concerned, why not stick those tuples in a dictionary, then apply y_serial to the dictionary. Probably any overhead will vanish due to the transparent compression in the background by zlib.]

As to readability, the documentation also gives details on why cPickle was selected over json.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM