使用稀疏数据的列表消耗的内存少于与numpy数组相同的数据

Question

I am working with very high dimensional vectors for machine learning and was thinking about using numpy to reduce the amount of memory used. 我正在使用非常高维度的向量进行机器学习，并且正在考虑使用numpy来减少使用的内存量。 I run a quick test to see how much memory I could save using numpy (1)(3): 我运行一个快速测试，看看我可以使用numpy（1）（3）节省多少内存：

Standard list 标准清单

import random
random.seed(0)
vector = [random.random() for i in xrange(2**27)]

Numpy array Numpy数组

import numpy
import random
random.seed(0)
vector = numpy.fromiter((random.random() for i in xrange(2**27)), dtype=float)

Memory usage (2) 内存使用情况（2）

Numpy array: 1054 MB
Standard list: 2594 MB

Just like I expected. 就像我预期的那样。

By allocing a continues block of memory with native floats numpy only consumes about half of the memory the standard list is using. 通过使用本机浮点数分配一个连续的内存块numpy只消耗标准列表正在使用的内存的大约一半。

Because I know my data is pretty spare, I did the same test with sparse data. 因为我知道我的数据非常多，所以我使用稀疏数据进行了相同的测试。

Standard list 标准清单

import random
random.seed(0)
vector = [random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)]

Numpy array Numpy数组

from numpy import fromiter
from random import random
random.seed(0)
vector = numpy.fromiter((random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)), dtype=float)

Memory usage (2) 内存使用情况（2）

Numpy array: 1054 MB
Standard list: 529 MB

Now all of the sudden, the python list uses half the amount of memory the numpy array uses! 现在突然间，python列表使用了numpy数组使用的内存量的一半！ Why? 为什么？

One thing I could think of is that python dynamically switches to a dict representation when it detects that it contains very sparse data. 我能想到的一件事是，当python检测到它包含非常稀疏的数据时，它会动态切换到dict表示。 Checking this could potentially add a lot of extra run-time overhead so I don't really think that this is going on. 检查这可能会增加很多额外的运行时间开销，所以我真的不认为这是在继续。

Notes 笔记

I started a fresh new python shell for every test. 我为每个测试开始了一个全新的python shell。
Memory measured with htop. 记忆用htop测量。
Run on 32bit Debian. 在32位Debian上运行。

Answer 1

A Python list is just an array of references (pointers) to Python objects. Python列表只是Python对象的引用（指针）数组。 In CPython (the usual Python implementation) a list gets slightly over-allocated to make expansion more efficient, but it never gets converted to a dict. 在CPython（通常的Python实现）中，列表稍微过度分配以使扩展更有效，但它永远不会转换为dict。 See the source code for further details: List object implementation 有关更多详细信息，请参阅源代码：列出对象实现

In the sparse version of the list, you have a lot of pointers to a single int 0 object. 在列表的稀疏版本中，您有许多指向单个int 0对象的指针。 Those pointers take up 32 bits = 4 bytes, but your numpy floats are certainly larger, probably 64 bits. 这些指针占用32位= 4个字节，但你的numpy浮点肯定更大，可能是64位。

FWIW, to make the sparse list / array tests more accurate you should call random.seed(some_const) with the same seed in both versions so that you get the same number of zeroes in both the Python list and the numpy array. FWIW，为了使稀疏列表/数组测试更准确，你应该在两个版本中使用相同的种子调用random.seed(some_const) ，以便在Python列表和numpy数组中获得相同数量的零。

使用稀疏数据的列表消耗的内存少于与numpy数组相同的数据

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-04-14 13:23:48

使用稀疏数据的列表消耗的内存少于与numpy数组相同的数据

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-04-14 13:23:48

解决方案1
2 已采纳 2015-04-14 13:23:48