简体   繁体   中英

Data size in memory vs. on disk

How does the RAM required to store data in memory compare to the disk space required to store the same data in a file? Or is there no generalized correlation?

For example, say I simply have a billion floating point values. Stored in binary form, that'd be 4 billion bytes or 3.7GB on disk (not including headers and such). Then say I read those values into a list in Python... how much RAM should I expect that to require?

Python Object Data Size

If the data is stored in some python object, there will be a little more data attached to the actual data in memory.

This may be easily tested.

各种形式的数据大小

It is interesting to note how, at first, the overhead of the python object is significant for small data, but quickly becomes negligible.

Here is the iPython code used to generate the plot

%matplotlib inline
import random
import sys
import array
import matplotlib.pyplot as plt

max_doubles = 10000

raw_size = []
array_size = []
string_size = []
list_size = []
set_size = []
tuple_size = []
size_range = range(max_doubles)

# test double size
for n in size_range:
    double_array = array.array('d', [random.random() for _ in xrange(n)])
    double_string = double_array.tostring()
    double_list = double_array.tolist()
    double_set = set(double_list)
    double_tuple = tuple(double_list)

    raw_size.append(double_array.buffer_info()[1] * double_array.itemsize)
    array_size.append(sys.getsizeof(double_array))
    string_size.append(sys.getsizeof(double_string))
    list_size.append(sys.getsizeof(double_list))
    set_size.append(sys.getsizeof(double_set))
    tuple_size.append(sys.getsizeof(double_tuple))

# display
plt.figure(figsize=(10,8))
plt.title('The size of data in various forms', fontsize=20)
plt.xlabel('Data Size (double, 8 bytes)', fontsize=15)
plt.ylabel('Memory Size (bytes)', fontsize=15)
plt.loglog(
    size_range, raw_size, 
    size_range, array_size, 
    size_range, string_size,
    size_range, list_size,
    size_range, set_size,
    size_range, tuple_size
)
plt.legend(['Raw (Disk)', 'Array', 'String', 'List', 'Set', 'Tuple'], fontsize=15, loc='best')

In a plain Python list, every double-precision number requires at least 32 bytes of memory, but only 8 bytes are used to store the actual number, the rest is necessary to support the dynamic nature of Python.

The float object used in CPython is defined in floatobject.h :

typedef struct {
    PyObject_HEAD
    double ob_fval;
} PyFloatObject;

where PyObject_HEAD is a macro that expands to the PyObject struct:

typedef struct _object {
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

Therefore, every floating point object in Python stores two pointer-sized fields (so each takes 8 bytes on a 64-bit architecture) besides the 8-byte double, giving 24 bytes of heap-allocated memory per number. This is confirmed by sys.getsizeof(1.0) == 24 .

This means that a list of n doubles in Python takes at least 8*n bytes of memory just to store the pointers ( PyObject* ) to the number objects, and each number object requires additional 24 bytes. To test it, try running the following lines in the Python REPL:

>>> import math
>>> list_of_doubles = [math.sin(x) for x in range(10*1000*1000)]

and see the memory usage of the Python interpreter (I got around 350 MB of allocated memory on my x86-64 computer). Note that if you tried:

>>> list_of_doubles = [1.0 for __ in range(10*1000*1000)]

you would obtain just about 80 MB, because all elements in the list refer to the same instance of the floating point number 1.0 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM