简体   繁体   English

内存中的数据大小与磁盘上的数据大小

[英]Data size in memory vs. on disk

How does the RAM required to store data in memory compare to the disk space required to store the same data in a file? 将数据存储在内存中所需的RAM与将相同数据存储在文件中所需的磁盘空间相比如何? Or is there no generalized correlation? 还是没有广义的相关性?

For example, say I simply have a billion floating point values. 例如,假设我仅具有十亿个浮点值。 Stored in binary form, that'd be 4 billion bytes or 3.7GB on disk (not including headers and such). 以二进制形式存储,在磁盘上为40亿字节或3.7GB(不包括标头等)。 Then say I read those values into a list in Python... how much RAM should I expect that to require? 然后说我将这些值读入Python列表中...我期望需要多少RAM?

Python Object Data Size Python对象数据大小

If the data is stored in some python object, there will be a little more data attached to the actual data in memory. 如果数据存储在某些python对象中,则内存中的实际数据将附加一些数据。

This may be easily tested. 这很容易测试。

各种形式的数据大小

It is interesting to note how, at first, the overhead of the python object is significant for small data, but quickly becomes negligible. 有趣的是,起初,python对象的开销对于小数据是如何显着的,但是很快就可以忽略不计了。

Here is the iPython code used to generate the plot 这是用于生成绘图的iPython代码

%matplotlib inline
import random
import sys
import array
import matplotlib.pyplot as plt

max_doubles = 10000

raw_size = []
array_size = []
string_size = []
list_size = []
set_size = []
tuple_size = []
size_range = range(max_doubles)

# test double size
for n in size_range:
    double_array = array.array('d', [random.random() for _ in xrange(n)])
    double_string = double_array.tostring()
    double_list = double_array.tolist()
    double_set = set(double_list)
    double_tuple = tuple(double_list)

    raw_size.append(double_array.buffer_info()[1] * double_array.itemsize)
    array_size.append(sys.getsizeof(double_array))
    string_size.append(sys.getsizeof(double_string))
    list_size.append(sys.getsizeof(double_list))
    set_size.append(sys.getsizeof(double_set))
    tuple_size.append(sys.getsizeof(double_tuple))

# display
plt.figure(figsize=(10,8))
plt.title('The size of data in various forms', fontsize=20)
plt.xlabel('Data Size (double, 8 bytes)', fontsize=15)
plt.ylabel('Memory Size (bytes)', fontsize=15)
plt.loglog(
    size_range, raw_size, 
    size_range, array_size, 
    size_range, string_size,
    size_range, list_size,
    size_range, set_size,
    size_range, tuple_size
)
plt.legend(['Raw (Disk)', 'Array', 'String', 'List', 'Set', 'Tuple'], fontsize=15, loc='best')

In a plain Python list, every double-precision number requires at least 32 bytes of memory, but only 8 bytes are used to store the actual number, the rest is necessary to support the dynamic nature of Python. 在普通的Python列表中,每个双精度数字至少需要32个字节的内存,但是只有8个字节用于存储实际数字,其余的数字对于支持Python的动态性质是必需的。

The float object used in CPython is defined in floatobject.h : CPython中使用的float对象在floatobject.h中定义:

typedef struct {
    PyObject_HEAD
    double ob_fval;
} PyFloatObject;

where PyObject_HEAD is a macro that expands to the PyObject struct: 其中PyObject_HEAD扩展PyObject结构的宏

typedef struct _object {
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

Therefore, every floating point object in Python stores two pointer-sized fields (so each takes 8 bytes on a 64-bit architecture) besides the 8-byte double, giving 24 bytes of heap-allocated memory per number. 因此,Python中的每个浮点对象除了8字节双精度字外,还存储两个指针大小的字段(因此在64位体系结构中每个字段占用8字节),每个数字提供24字节的堆分配内存。 This is confirmed by sys.getsizeof(1.0) == 24 . 这由sys.getsizeof(1.0) == 24确认。

This means that a list of n doubles in Python takes at least 8*n bytes of memory just to store the pointers ( PyObject* ) to the number objects, and each number object requires additional 24 bytes. 这意味着在Python中,由n双精度数组成的列表至少要占用8*n字节的内存,才能存储指向数字对象的指针PyObject* ),而每个数字对象又需要24个字节。 To test it, try running the following lines in the Python REPL: 要对其进行测试,请尝试在Python REPL中运行以下几行:

>>> import math
>>> list_of_doubles = [math.sin(x) for x in range(10*1000*1000)]

and see the memory usage of the Python interpreter (I got around 350 MB of allocated memory on my x86-64 computer). 并查看Python解释器的内存使用情况(我在x86-64计算机上获得了约350 MB的已分配内存)。 Note that if you tried: 请注意,如果您尝试:

>>> list_of_doubles = [1.0 for __ in range(10*1000*1000)]

you would obtain just about 80 MB, because all elements in the list refer to the same instance of the floating point number 1.0 . 您将获得大约80 MB的空间,因为列表中的所有元素都引用浮点数1.0相同实例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM