简体   繁体   English

列表和 NumPy 数组 memory 大小之间的区别

[英]Difference between list and NumPy array memory size

I've heard that Numpy arrays are more efficient then python built in list and that they are taking less space in memory.我听说 Numpy arrays 比 python 内置列表更有效,而且它们在 ZCD69B4957F06CD8DE219D7 中占用的空间更少。 As I understand Numpy stores this objects next to each other in memory, while python implementation of the list stores 8 bytes pointers to given values.据我了解,Numpy 将这些对象彼此相邻地存储在 memory 中,而列表的 python 实现存储指向给定值的 8 个字节指针。 However when i try to test in jupyter notebook it turns out that both objects have same size.但是,当我尝试在 jupyter notebook 中进行测试时,结果发现两个对象的大小相同。

import numpy as np
from sys import getsizeof
array = np.array([_ for _ in range(4)])
getsizeof(array), array

Returns (128, array([0, 1, 2, 3])) Same as:返回(128, array([0, 1, 2, 3]))同:

l = list([_ for _ in range(4)])
getsizeof(l), l

Gives (128, [0, 1, 2, 3])给出(128, [0, 1, 2, 3])

Can you provide any clear example on how I show that in jupyter notebook?你能提供任何明确的例子来说明我如何在 jupyter notebook 中展示它吗?

getsizeof is not a good measure of memory use, especially with lists. getsizeof不能很好地衡量 memory 的使用,尤其是列表。 As you note the list has a buffer of pointers to objects elsewhere in memory.正如您所注意到的,该列表具有指向 memory 中其他地方的对象的指针缓冲区。 getsizeof notes the size of the buffer, but tells us nothing about the objects. getsizeof记录缓冲区的大小,但没有告诉我们有关对象的任何信息。

With

In [66]: list(range(4))
Out[66]: [0, 1, 2, 3]

the list has its basic object storage, plus the buffer with 4 pointers (plus some growth room).该列表有其基本的 object 存储,加上带有 4 个指针的缓冲区(加上一些增长空间)。 The numbers are stored else where.这些数字存储在其他地方。 In this case the numbers are small, and already created and cached by the interpreter.在这种情况下,数字很小,并且已经由解释器创建和缓存。 So their storage doesn't add anything.所以他们的存储不会增加任何东西。 But larger numbers (and floats) are created with each use, and take up space.但是每次使用都会创建更大的数字(和浮点数),并占用空间。 Also a list can contain anything, such as pointers to other lists, or strings or dicts, or what ever.列表也可以包含任何内容,例如指向其他列表的指针、字符串或字典,或任何其他内容。

In [67]: arr = np.array([i for i in range(4)])   # via list
In [68]: arr
Out[68]: array([0, 1, 2, 3])
In [69]: np.array(range(4))            # more direct
Out[69]: array([0, 1, 2, 3])
In [70]: np.arange(4)
Out[70]: array([0, 1, 2, 3])           # faster

arr too has a basic object storage with attributes like shape and dtype. arr也有一个基本的 object 存储,具有形状和 dtype 等属性。 It too has a databuffer, but for a numeric dtype like this, that buffer has actual numeric values (8 byte integers), not pointers to Python integer objects.它也有一个数据缓冲区,但是对于像这样的数字 dtype,该缓冲区具有实际数值(8 字节整数),而不是指向 Python integer 对象的指针。

In [71]: arr.nbytes
Out[71]: 32

That data buffer only takes 32 bytes - 4*8.该数据缓冲区仅占用 32 个字节 - 4*8。

For this small example it's not surprising that getsizeof returns the same thing.对于这个小例子, getsizeof返回相同的东西并不奇怪。 The basic object storage is more significant than where the 4 values are stored.基本的 object 存储比存储 4 个值的位置更重要。 It's when working with 1000's of values, and multidimensional arrays that memory use is significantly different.当使用 1000 个值和多维 arrays 时,memory 的使用明显不同。

But more important is the calculation speeds.但更重要的是计算速度。 With an array you can do things like arr+1 or arr.sum() .使用数组,您可以执行arr+1arr.sum()类的操作。 These operate in compiled code, and are quite fast.它们在编译后的代码中运行,并且速度非常快。 Similar list operations have to iterate, at slow Python speeds, though the pointers, fetching values etc. But doing the same sort of iteration on arrays is even slower.类似的列表操作必须以较慢的 Python 速度进行迭代,尽管指针、获取值等。但是在 arrays 上进行相同类型的迭代甚至更慢。

As a general rule, if you start with lists, and do list operations such as append and list comprehensions, it's best to stick with them.作为一般规则,如果您从列表开始,并执行append和列表推导等列表操作,最好坚持使用它们。

But if you can create the arrays once, or from other arrays, and then use numpy methods, you'll get 10x speed improvements.但是,如果您可以创建一次 arrays,或者从其他 arrays,然后使用numpy方法,您将获得 10 倍的速度提升。 Arrays are indeed faster, but only if you use them in the right way. Arrays 确实更快,但前提是您以正确的方式使用它们。 They aren't a simple drop in substitute for lists.它们不是列表的简单替代品。

NumPy array has general array information on the array object header (like shape,data type etc.). NumPy 数组具有关于数组 object header 的一般数组信息(如形状、数据类型等)。 All the values stored in continous block of memory.所有值都存储在 memory 的连续块中。 But lists allocate new memory block for every new object and stores their pointer.但是列表为每个新的 object 分配新的 memory 块并存储它们的指针。 So when you iterate over, you are not directly iterating on memory.因此,当您迭代时,您不会直接在 memory 上进行迭代。 you are iterating over pointers.您正在迭代指针。 So it is not handy when you are working with large data.因此,当您处理大数据时,它并不方便。 Here is an example:这是一个例子:

import sys
import numpy as np

random_values_numpy=np.arange(1000)
random_values=range(1000)  
#Numpy
print(random_values_numpy.itemsize)
print(random_values_numpy.size*random_values_numpy.itemsize)  
#PyList
print(sys.getsizeof(random_values))
print(sys.getsizeof(random_values)*len(random_values))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM