简体   繁体   English

如何从大量数据中有效构造一个numpy数组?

[英]How to efficiently construct a numpy array from a large set of data?

If I have a huge list of lists in memory and I wish to convert it into an array, does the naive approach cause python to make a copy of all the data, taking twice the space in memory? 如果我在内存中有一个庞大的列表列表,并且希望将其转换为数组,那么幼稚的方法是否会使python复制所有数据,从而占用内存的两倍空间? Should I convert a list of lists, vector by vector instead by popping? 我应该按矢量而不是通过弹出菜单转换列表列表吗?

# for instance
list_of_lists = [[...], ..., [...]]
arr = np.array(list_of_lists)

Edit: Is it better to create an empty array of a known size and then populate it incrementally thus avoiding the list_of_lists object entirely? 编辑:是否最好创建一个已知大小的空数组,然后以增量方式填充它,从而完全避免使用list_of_lists对象? Could this be accomplished by something as simply as some_array[i] = some_list_of_float_values ? 是否可以通过some_array[i] = some_list_of_float_values简单方法来完成?

I'm just puttign theis here as it's a bit long for a comment. 我在这里只是恶作剧,因为要发表评论有点长。

Have you read the numpy documentation for array ? 您是否已阅读有关numpy的array文档?

numpy.array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)
"""
...
copy : bool, optional
  If true (default), then the object is copied. Otherwise, a copy will
  only be made if __array__ returns a copy, if obj is a nested sequence,
  or if a copy is needed to satisfy any of the other requirements (dtype,
  order, etc.).
...
"""

When you say you don't want to copy the data of the original array when creating the numpy array, what data structure are you hoping to end up with? 当您说不想在创建numpy数组时复制原始数组的数据时,您希望得到什么样的数据结构?

A lot of the speed up you get from using numpy is because the C arrays that are created are contiguous in memory. 使用numpy可以numpy提高速度,因为创建的C数组在内存中是连续的。 An array in python is just an array of pointers to objects, so you have to go and find the objects every time - which isn't the case in numpy , as it's not written in python. python中的数组只是指向对象的指针数组,因此您必须每次都去查找对象numpy不是这种情况,因为它不是用python编写的。

If you want to just have the numpy array reference the python arrays in your 2D array, then you'll lose the performance gains. 如果只想让numpy数组引用2D数组中的python数组,那么您将失去性能提升。

if you do np.array(my_2D_python_array, copy=False) i don't know what it will actually produce, but you could easily test it yourself. 如果执行np.array(my_2D_python_array, copy=False)我不知道它将实际产生什么,但是您可以自己轻松地对其进行测试。 Look at the shape of the array, and see what kind of objects it houses. 查看阵列的形状,并查看其容纳的对象类型。

If you want the numpy array to be contiguous though, as some point you're going to have to allocate all of the memory it needs (which if it's as large as you're suggesting, it sounds like it might be difficult to find a contiguous section large enough). 如果您希望numpy数组是连续的,那么从某种意义上讲,您将不得不分配它需要的所有内存(如果它的大小与您的建议一样大,听起来可能很难找到一个连续部分足够大)。

Sorry that was pretty rambling, just a comment. 对不起,那真是个漫无边际的评论。 How big are the actual arrays you're looking at? 您要查看的实际数组有多大?

Here's a plot of the cpu usage and memory usage of a small sample program: 这是一个小示例程序的cpu使用率和内存使用情况的图表:

from __future__ import division

#Make a large python 2D array
N, M = 10000, 18750
print "%i x %i = %i doubles = %f GB" % (N, M, N * M, N*M*8/10**9)

#grab pid to moniter memory and cpu usage
import os
pid = os.getpid()

os.system("python moniter.py -p " + str(pid) + " &")


print "building python matrix"
large_2d_array = [[n + m*M for n in range(N)] for m in range(M)]


import numpy
from datetime import datetime

print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=True)
print datetime.now(), "deleting array"
del(np1)


print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=False)
print datetime.now(), "deleting array"
del(np1)

在此处输入图片说明

1, 2, and 3 are the points where each of the matrices finish being created. 1、2和3是每个矩阵完成创建的点。 Note that the native python array takes up much more memory than the numpy ones - python objects each have their own overhead, and the lists are lists of objects. 请注意,本机python数组比numpy数组占用更多的内存-python对象每个都有自己的开销,并且列表是对象列表。 For the numpy array this is not the case, so it is considerably smaller. 对于numpy数组,情况并非如此,因此要小得多。

Also note that using the copy on the python object has no effect - new data is always created. 另请注意,在python对象上使用副本无效-始终创建新数据。 You could get around this by creating a numpy array of python objects (using dtype=object ), but i wouldn't advise it. 您可以通过创建一个numpy的python对象数组(使用dtype=object )来解决此问题,但是我不建议这样做。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM