简体   繁体   中英

How much overhead does python numpy tolist() add?

I am using a python program that uses numpy array as standard data type for arrays. For the heavy computation I pass the arrays to a C++ library. In order to do so, I use pybind . However, I am required to use python list . I do the conversion from numpy array and list via:

NativeSolver.vector_add(array1.tolist(), array2.tolist(), ...)

How much overhead does this conversion generate? I hope it doesn't create a whole new copy. Numpy reference says:

ndarray.tolist()

Return a copy of the array data as a (nested) Python list. Data items are converted to the nearest compatible Python type.

A lot. For simple built-in types, you can use sys.getsizeof on an object to determine the memory overhead associated with that object (for containers, this does not include the values stored in it, only the pointers used to store them).

So for example, a list of 100 smallish int s (but greater than 256 to avoid small int cache) is (on my 3.5.1 Windows x64 install):

>>> sys.getsizeof([0] * 100) + sys.getsizeof(0) * 100
3264

or about 3 KB of memory required. If those same values were stored in a numpy array of int32 s, with no Python objects per number, and no per object pointers, the size would drop to roughly 100 * 4 (plus another few dozen bytes, for the array object overhead itself), somewhere under 500 bytes. The incremental cost for each additional small int is 24 bytes for the object (though it's free if it's in the small int cache for values from -5 to 256 IIRC), and 8 bytes for the storage in the list , 32 bytes total, vs. 4 for the C level type, roughly 8x the storage requirements (and you're still storing the original object too).

If you have enough memory to deal with it, so be it. But otherwise, you might trying looking at a wrapping that lets you pass in buffer protocol supporting objects ( numpy.array , array.array on Py3, ctypes arrays populated via memoryview slice assignment, etc.) so conversion to Python level types isn't needed.

Yes it will be new copy. The data layout for an array is very different from that of a list.

An array has attributes like shape and strides, and a 1d data buffer that contains the elements - just a contiguous set of bytes. It's the other attributes and code that treats them as floats, int, strings, 1d, 2d etc.

A list is a buffer of pointers, with each pointer pointing to an object else where in memory. It may point to a number, a string, or another list. It is not going to point to the array's databuffer or elements in it.

There are interfacing numpy arrays with compiled code and C arrays that make use of the array data buffer. cython is a common on. There is also a whole documentation section on the C API for numpy. I know anything about pbind . If it requires a list interface it may not be the best.

When I've done timeit tests with tolist() it hasn't appeared to be that expensive.

=======================

But looking at the pybind11 github I find a number of references to numpy , and this

http://pybind11.readthedocs.io/en/latest/advanced.html#numpy-support

documentation page. It supports the buffer protocol, and numpy arrays. So you shouldn't have to go through the tolist step.

#include <pybind11/numpy.h>
void f(py::array_t<double> array);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM