简体   繁体   中英

List with sparse data consumes less memory then the same data as numpy array

I am working with very high dimensional vectors for machine learning and was thinking about using numpy to reduce the amount of memory used. I run a quick test to see how much memory I could save using numpy (1)(3):

Standard list

import random
random.seed(0)
vector = [random.random() for i in xrange(2**27)]

Numpy array

import numpy
import random
random.seed(0)
vector = numpy.fromiter((random.random() for i in xrange(2**27)), dtype=float)

Memory usage (2)

Numpy array: 1054 MB
Standard list: 2594 MB

Just like I expected.

By allocing a continues block of memory with native floats numpy only consumes about half of the memory the standard list is using.

Because I know my data is pretty spare, I did the same test with sparse data.

Standard list

import random
random.seed(0)
vector = [random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)]

Numpy array

from numpy import fromiter
from random import random
random.seed(0)
vector = numpy.fromiter((random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)), dtype=float)

Memory usage (2)

Numpy array: 1054 MB
Standard list: 529 MB

Now all of the sudden, the python list uses half the amount of memory the numpy array uses! Why?

One thing I could think of is that python dynamically switches to a dict representation when it detects that it contains very sparse data. Checking this could potentially add a lot of extra run-time overhead so I don't really think that this is going on.

Notes

  1. I started a fresh new python shell for every test.
  2. Memory measured with htop.
  3. Run on 32bit Debian.

A Python list is just an array of references (pointers) to Python objects. In CPython (the usual Python implementation) a list gets slightly over-allocated to make expansion more efficient, but it never gets converted to a dict. See the source code for further details: List object implementation

In the sparse version of the list, you have a lot of pointers to a single int 0 object. Those pointers take up 32 bits = 4 bytes, but your numpy floats are certainly larger, probably 64 bits.

FWIW, to make the sparse list / array tests more accurate you should call random.seed(some_const) with the same seed in both versions so that you get the same number of zeroes in both the Python list and the numpy array.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM