简体   繁体   English

numpy数组大小与连接速度

[英]numpy array size vs. speed of concatenation

I am concatenating data to a numpy array like this: 我将数据连接到一个像这样的numpy数组:

xdata_test = np.concatenate((xdata_test,additional_X))

This is done a thousand times. 这已经完成了一千次。 The arrays have dtype float32 , and their sizes are shown below: 数组有dtype float32 ,它们的大小如下所示:

xdata_test.shape   :  (x1,40,24,24)        (x1 : [500~10500])   
additional_X.shape :  (x2,40,24,24)        (x2 : [0 ~ 500])

The problem is that when x1 is larger than ~2000-3000, the concatenation takes a lot longer. 问题是当x1大于~2000-3000时,连接需要更长的时间。

The graph below plots the concatenation time versus the size of the x2 dimension: 下图显示了连接时间与x2维度的大小:

x2与时间消耗

Is this a memory issue or a basic characteristic of numpy? 这是一个记忆问题还是numpy的基本特征?

As far as I understand numpy, all the stack and concatenate functions are not extremely efficient. 据我所知numpy,所有stackconcatenate函数都不是非常有效。 And for good reasons, because numpy tries to keep array memory contiguous for efficiency (see this link about contiguous arrays in numpy ) 并且有充分的理由,因为numpy试图保持数组内存连续效率(请参阅此链接关于numpy中的连续数组

That means that every concatenate operation have to copy the whole data every time. 这意味着每次连接操作都必须每次都复制整个数据。 When I need to concatenate a bunch of elements together I tend to do this : 当我需要将一堆元素连接在一起时,我倾向于这样做:

l = []
for additional_X in ...:
    l.append(addiional_X)
xdata_test = np.concatenate(l)

That way, the costly operation of moving the whole data is only done once. 这样,移动整个数据的昂贵操作只进行一次。

NB : would be interested in the speed improvement that gives you. 注意:对提供给你的速度提升感兴趣。

If you have in advance the arrays you want to concatenate, I would suggest creating a new array with the total shape and fill it with the small arrays rather than concatenating, as every concatenation operation needs to copy the whole data to a new contiguous space of memory. 如果你事先预定了要连接的数组,我建议创建一个具有总形状的新数组,并用小数组填充它而不是连接,因为每个连接操作都需要将整个数据复制到一个新的连续空间记忆。

  • First, calculate the total size of the first axis: 首先,计算第一轴的总大小:

     max_x = 0 for arr in list_of_arrays: max_x += arr.shape[0] 
  • Second, create the end container: 其次,创建最终容器:

     final_data = np.empty((max_x,) + xdata_test.shape[1:], dtype=xdata_test.dtype) 

    which is equivalent to (max_x, 40, 24, 24) but dynamically typed. 这相当于(max_x, 40, 24, 24)但动态输入。

  • Last, fill the numpy array: 最后,填充numpy数组:

     curr_x = 0 for arr in list_of_arrays: final_data[curr_x:curr_x+arr.shape[0]] = arr curr_x += arr.shape[0] 

The loop above, copies each of the arrays to a previously defined column/rows of the larger array. 上面的循环将每个数组复制到较大数组的先前定义的列/行。

By doing this, each of the N arrays will be copied to the exact final destination, rather than creating temporal arrays for each of the concatenation. 通过这样做, N数组中的每一个都将被复制到确切的最终目标,而不是为每个连接创建时间数组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM