简体   繁体   中英

Difference between list and NumPy array memory size

I've heard that Numpy arrays are more efficient then python built in list and that they are taking less space in memory. As I understand Numpy stores this objects next to each other in memory, while python implementation of the list stores 8 bytes pointers to given values. However when i try to test in jupyter notebook it turns out that both objects have same size.

import numpy as np
from sys import getsizeof
array = np.array([_ for _ in range(4)])
getsizeof(array), array

Returns (128, array([0, 1, 2, 3])) Same as:

l = list([_ for _ in range(4)])
getsizeof(l), l

Gives (128, [0, 1, 2, 3])

Can you provide any clear example on how I show that in jupyter notebook?

getsizeof is not a good measure of memory use, especially with lists. As you note the list has a buffer of pointers to objects elsewhere in memory. getsizeof notes the size of the buffer, but tells us nothing about the objects.

With

In [66]: list(range(4))
Out[66]: [0, 1, 2, 3]

the list has its basic object storage, plus the buffer with 4 pointers (plus some growth room). The numbers are stored else where. In this case the numbers are small, and already created and cached by the interpreter. So their storage doesn't add anything. But larger numbers (and floats) are created with each use, and take up space. Also a list can contain anything, such as pointers to other lists, or strings or dicts, or what ever.

In [67]: arr = np.array([i for i in range(4)])   # via list
In [68]: arr
Out[68]: array([0, 1, 2, 3])
In [69]: np.array(range(4))            # more direct
Out[69]: array([0, 1, 2, 3])
In [70]: np.arange(4)
Out[70]: array([0, 1, 2, 3])           # faster

arr too has a basic object storage with attributes like shape and dtype. It too has a databuffer, but for a numeric dtype like this, that buffer has actual numeric values (8 byte integers), not pointers to Python integer objects.

In [71]: arr.nbytes
Out[71]: 32

That data buffer only takes 32 bytes - 4*8.

For this small example it's not surprising that getsizeof returns the same thing. The basic object storage is more significant than where the 4 values are stored. It's when working with 1000's of values, and multidimensional arrays that memory use is significantly different.

But more important is the calculation speeds. With an array you can do things like arr+1 or arr.sum() . These operate in compiled code, and are quite fast. Similar list operations have to iterate, at slow Python speeds, though the pointers, fetching values etc. But doing the same sort of iteration on arrays is even slower.

As a general rule, if you start with lists, and do list operations such as append and list comprehensions, it's best to stick with them.

But if you can create the arrays once, or from other arrays, and then use numpy methods, you'll get 10x speed improvements. Arrays are indeed faster, but only if you use them in the right way. They aren't a simple drop in substitute for lists.

NumPy array has general array information on the array object header (like shape,data type etc.). All the values stored in continous block of memory. But lists allocate new memory block for every new object and stores their pointer. So when you iterate over, you are not directly iterating on memory. you are iterating over pointers. So it is not handy when you are working with large data. Here is an example:

import sys
import numpy as np

random_values_numpy=np.arange(1000)
random_values=range(1000)  
#Numpy
print(random_values_numpy.itemsize)
print(random_values_numpy.size*random_values_numpy.itemsize)  
#PyList
print(sys.getsizeof(random_values))
print(sys.getsizeof(random_values)*len(random_values))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM