在Python中存储大文件的最快方法

Question

I recently asked a question regarding how to save large python objects to file. 我最近问了一个关于如何将大型python对象保存到文件的问题。 I had previously run into problems converting massive Python dictionaries into string and writing them to file via write() . 我以前遇到过将大量Python字典转换为字符串并通过write()将它们write()文件的问题。 Now I am using pickle. 现在我正在使用泡菜。 Although it works, the files are incredibly large (> 5 GB). 虽然它可以工作，但文件非常大（> 5 GB）。 I have little experience in the field of such large files. 我在这么大的文件领域经验不多。 I wanted to know if it would be faster, or even possible, to zip this pickle file prior to storing it to memory. 我想知道在将这个pickle文件存储到内存之前是否更快，甚至可能将其压缩。

Answer 1

You can compress the data with bzip2 : 您可以使用bzip2压缩数据：

from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib

hugeData = {'key': {'x': 1, 'y':2}}
with contextlib.closing(bz2.BZ2File('data.json.bz2', 'wb')) as f:
  json.dump(hugeData, f)

Load it like this: 加载它像这样：

from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib

with contextlib.closing(bz2.BZ2File('data.json.bz2', 'rb')) as f:
  hugeData = json.load(f)

You can also compress the data using zlib or gzip with pretty much the same interface. 您也可以使用zlib或gzip压缩数据，使用几乎相同的界面。 However, both zlib and gzip's compression rates will be lower than the one achieved with bzip2 (or lzma). 但是，zlib和gzip的压缩率都低于用bzip2（或lzma）实现的压缩率。

Answer 2

Python code would be extremely slow when it comes to implementing data serialization. 在实现数据序列化时，Python代码会非常慢。 If you try to create an equivalent to Pickle in pure Python, you'll see that it will be super slow. 如果你尝试在纯Python中创建一个与Pickle等价的东西，你会发现它会超级慢。 Fortunately the built-in modules which perform that are quite good. 幸运的是，执行该功能的内置模块非常好。

Apart from cPickle , you will find the marshal module, which is a lot faster. 除了cPickle之外，你会发现marshal模块，速度要快得多。 But it needs a real file handle (not from a file-like object). 但它需要一个真正的文件句柄（而不是来自类文件对象）。 You can import marshal as Pickle and see the difference. 您可以import marshal as Pickle并查看差异。 I don't think you can make a custom serializer which is a lot faster than this... 我不认为你可以制作比这更快的自定义序列化器......

Here's an actual (not so old) serious benchmark of Python serializers 这是Python序列化程序的实际（不是那么老）严肃的基准

Answer 3

I'd just expand on phihag's answer. 我只想扩展phihag的答案。

When trying to serialize an object approaching the size of RAM, pickle/cPickle should be avoided , since it requires additional memory of 1-2 times the size of the object in order to serialize. 当尝试序列化接近RAM大小的对象时， 应该避免使用pickle / cPickle ，因为它需要额外的内存，大小是对象大小的1-2倍才能序列化。 That's true even when streaming it to BZ2File. 即使将其流式传输到BZ2File也是如此。 In my case I was even running out of swap space. 在我的情况下，我甚至耗尽了交换空间。

But the problem with JSON (and similarly with HDF files as mentioned in the linked article) is that it cannot serialize tuples, which in my data are used as keys to dicts. 但是JSON的问题（以及与链接文章中提到的HDF文件类似）是它无法序列化元组，这在我的数据中用作dicts的键。 There is no great solution for this; 对此没有很好的解决方案; the best I could find was to convert tuples to strings, which requires some memory of its own, but much less than pickle. 我能找到的最好的方法是将元组转换为字符串，这需要一些自己的内存，但要比pickle少得多。 Nowadays, you can also use the ujson library , which is much faster than the json library. 如今，您还可以使用ujson库，它比json库快得多。

For tuples composed of strings (requires strings to contain no commas): 对于由字符串组成的元组（要求字符串不包含逗号）：

import ujson as json
from bz2 import BZ2File

bigdata = { ('a','b','c') : 25, ('d','e') : 13 }
bigdata = dict([(','.join(k), v) for k, v in bigdata.viewitems()]) 

f = BZ2File('filename.json.bz2',mode='wb')
json.dump(bigdata,f)
f.close()

To re-compose the tuples: 要重新组合元组：

bigdata = dict([(tuple(k.split(',')),v) for k,v in bigdata.viewitems()])

Alternatively if eg your keys are 2-tuples of integers: 或者，例如，您的键是2元组的整数：

bigdata2 = { (1,2): 1.2, (2,3): 3.4}
bigdata2 = dict([('%d,%d' % k, v) for k, v in bigdata2.viewitems()])
# ... save, load ...
bigdata2 = dict([(tuple(map(int,k.split(','))),v) for k,v in bigdata2.viewitems()])

Another advantage of this approach over pickle is that json appears to compress a significantly better than pickles when using bzip2 compression. 这种方法相对于泡菜的另一个优点是，当使用bzip2压缩时，json似乎比泡菜更好地压缩。

Answer 4

faster, or even possible, to zip this pickle file prior to [writing] 更快，甚至可能在[写作]之前压缩这个pickle文件

Of course it's possible, but there's no reason to try to make an explicit zipped copy in memory (it might not fit!) before writing it, when you can automatically cause it to be zipped as it is written, with built-in standard library functionality ;) 当然这是可能的，但没有理由在编写它之前尝试在内存中创建一个显式的压缩副本（它可能不适合！），当你可以自动使它在编写时压缩，内置标准库功能 ;）

See http://docs.python.org/library/gzip.html . 见http://docs.python.org/library/gzip.html 。 Basically, you create a special kind of stream with 基本上，你用它创建一种特殊的流

gzip.GzipFile("output file name", "wb")

and then use it exactly like an ordinary file created with open(...) (or file(...) for that matter). 然后使用它就像使用open(...) （或file(...)创建的普通file一样）。

Answer 5

Look at Google's ProtoBuffers . 看看谷歌的ProtoBuffers 。 Although they are not designed for large files out-of-the box, like audio-video files, they do well with object serialization as in your case, because they were designed for it. 虽然它们并非设计用于开箱即用的大型文件，如音频 - 视频文件，但它们可以很好地处理对象序列化，因为它们是为它设计的。 Practice shows that some day you may need to update structure of your files, and ProtoBuffers will handle it. 实践表明，有一天您可能需要更新文件的结构，ProtoBuffers将处理它。 Also, they are highly optimized for compression and speed. 此外，它们还针对压缩和速度进行了高度优化。 And you're not tied to Python, Java and C++ are well supported. 而且你不依赖于Python，Java和C ++都得到了很好的支持。

在Python中存储大文件的最快方法

问题描述

5 个解决方案

解决方案1
10 2011-10-03 23:02:10

解决方案2
4 已采纳 2011-10-04 00:03:06

解决方案3
1 2014-05-06 17:31:27

解决方案4
1 2011-10-03 23:03:12

解决方案5
-1 2011-10-03 23:10:44

在Python中存储大文件的最快方法

问题描述

5 个解决方案

解决方案1 10 2011-10-03 23:02:10

解决方案2 4 已采纳 2011-10-04 00:03:06

解决方案3 1 2014-05-06 17:31:27

解决方案4 1 2011-10-03 23:03:12

解决方案5 -1 2011-10-03 23:10:44

解决方案1
10 2011-10-03 23:02:10

解决方案2
4 已采纳 2011-10-04 00:03:06

解决方案3
1 2014-05-06 17:31:27

解决方案4
1 2011-10-03 23:03:12

解决方案5
-1 2011-10-03 23:10:44