简体   繁体   English

Python Pickle需要多少内存?

[英]How much memory does a Python Pickle take?

I have some pandas DataFrames which I can save to disk with .to_pickle() . 我有一些pandas DataFrames,可以使用.to_pickle()保存到磁盘。 Such an object is 200k-700k. 这样的对象是200k-700k。

I see from memcache.py in the Python-memcache github project , it pickles objects and compresses them before caching. 我从Python-memcache github项目中的memcache.py中看到,它腌制对象并在缓存之前对其进行压缩。

By default, memcached only allows values up to 1MB. 默认情况下,memcached仅允许最大1MB的值。 I find that trying to cache my 200k DataFrames works fine, but the 600k ones don't get set at the Python memcache level (the client doesn't even issue the set command unless I use -I on memcached and set memcache.SERVER_MAX_VALUE_LENGTH accordingly for my Python client). 我发现尝试缓存我的200k DataFrames可以正常工作,但是600k的DataFrame没有在Python内存缓存级别设置(除非我在memcached上使用-I并相应地设置memcache.SERVER_MAX_VALUE_LENGTH ,否则客户端甚至不会发出set命令为我的Python客户端)。

Storing ~100 such dataframes to memcache with -I 5m allows them all to fit and takes up 36MB (36212 bytes) on disk with written pickle files. 使用-I 5m将大约100个这样的数据帧存储到内存缓存中,它们全部都可以容纳,并且在写入了腌制文件的磁盘上占用36MB(36212字节)。 Per memcached stats command, I see nearly 3x the bytes were written, 根据memcached stats命令,我看到写入的字节几乎是原来的3倍,

STAT bytes_read 89917017
STAT bytes_written 89917211
...
STAT bytes 53022739

It is then strange that only 53MB are being stored if 89MB were written. 那么奇怪的是,如果写入89MB,则仅存储53MB。

If I alter my memcaching code to pickle the DataFrames first (ie write to a tempfile with .to_pickle() , read that tempfile to store to memcache), I see the data sizes per memcache stats matching what's on disk when i store the same files. 如果我更改内存管理代码以首先对DataFrame进行酸洗(即使用.to_pickle()写入临时文件,读取该临时文件以存储到内存缓存),则当我存储相同文件时,我会看到每个内存缓存stats的数据大小与磁盘上的内容匹配。

STAT bytes_read 36892901
STAT bytes_written 36893095
...
STAT bytes 36896667

What is the ratio of memory used to store a pickled object compared to its size on disk? 相较于其在磁盘上的大小,用于存储腌制对象的内存比例是多少? And why wouldn't python memcache do a similarly efficient job of converting DataFrames to smaller pickle sizes as using .to_pickle() ? 并且为什么python memcache不会像使用.to_pickle()一样,将DataFrames转换为较小的泡菜尺寸,执行类似的有效工作?

It seems that python-memcache uses ASCII encoding when it pickles objects, while pandas' to_pickle() uses pickle v2 in binary encoding which is smaller. 似乎python-memcache在对对象进行腌制时会使用ASCII编码,而pandas的to_pickle()在较小的二进制编码中使用pickle v2。 Were I to export my dataframes to CSV per @BrenBarn's suggestion, I'd get files slightly larger than the binary dataframes, but still ~1/3 the size of an ASCII-pickled dataframe. 如果我按照@BrenBarn的建议将数据帧导出为CSV,我会得到比二进制数据帧稍大的文件,但仍约为ASCII腌制数据帧的1/3。

My workaround to use Pandas to do the binary pickling before memcaching goes like this (I also added namespace arg much like Google App Engine to help ensure key uniqueness when using the same memcache for different applications): 我的解决方法是,在使用memcaching之前使用Pandas进行二进制腌制(我还像Google App Engine一样添加了namespace arg,以确保在为不同的应用程序使用相同的内存缓存时确保键唯一性):

import memcache
import tempfile
import pandas as pd
mc_client = memcache.Client(['localhost:11211'], debug=0)
def mc_get(key, namespace):
    """ Gets pickle from Memecache and converts to a dataframe
    """
    data = mc_client.get('{}_{}'.format(namespace, key))
    if data is None:
        return
    temp_file = tempfile.NamedTemporaryFile()
    temp_file.write(data)
    temp_file.flush()
    return pd.read_pickle(temp_file.name)

def mc_set(key, df, namespace):
    """ Convert dataframe to dict and store it to memcache
    """
    temp_file = tempfile.NamedTemporaryFile()
    df.to_pickle(temp_file.name)
    temp_file.flush()
    data = temp_file.read()
    mc_client.set('{}_{}'.format(namespace, key), data)

It may seem like this usage of a tempfile would slow things down as it writes to disk, but test show it being 2x as fast as just storing pickles to disk and loading them from there. 看起来,使用tempfile会将它写入磁盘时会减慢速度,但是测试表明,将pickle文件存储到磁盘并从那里加载它们的速度是其两倍。

Looking at the python-memcached code, I see that one can use set(min_compress_len=X) to trigger python-memcache to compress values before setting them. 查看python-memcached代码,我发现可以使用set(min_compress_len=X)触发python-memcache在设置值之前对其进行压缩。 Using this method reduced memory to 40% of what my pre-pickling trick did. 使用这种方法将内存减少到我的预浸技巧的40%。

Lastly, the python-memcached constructor takes a pickleProtocol arg which, if set to 2 , will use the same pickling protocol Pandas' to_pickle() does. 最后, python-memcached构造函数采用pickleProtocol arg,如果将其设置为2 ,则将使用与Pandas的to_pickle()相同的酸洗协议。

Combining pickleProtocol=2 with min_compress_len=1 (to cause compression always) caused memory usage to be about 25% of what it was with binary pickling alone, and the overhead to compress added about 13% to the run time to write all my dataframes to memcache. pickleProtocol=2min_compress_len=1结合在一起(总是导致压缩)导致内存使用量大约是单独使用二进制酸洗的25%,并且压缩的开销将运行时间增加了约13%,从而将所有数据帧写入记忆快取。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM