简体   繁体   English

Pythonic从numpy数组列表中创建numpy数组的方法

[英]Pythonic way to create a numpy array from a list of numpy arrays

I generate a list of one dimensional numpy arrays in a loop and later convert this list to a 2d numpy array. 我在循环中生成一维numpy数组的列表,然后将此列表转换为2d numpy数组。 I would've preallocated a 2d numpy array if i knew the number of items ahead of time, but I don't, therefore I put everything in a list. 如果我提前知道项目的数量,我会预先分配一个2d numpy数组,但我没有,因此我将所有内容都放在列表中。

The mock up is below: 模拟如下:

>>> list_of_arrays = map(lambda x: x*ones(2), range(5))
>>> list_of_arrays
[array([ 0.,  0.]), array([ 1.,  1.]), array([ 2.,  2.]), array([ 3.,  3.]), array([ 4.,  4.])]
>>> arr = array(list_of_arrays)
>>> arr
array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.]])

My question is the following: 我的问题如下:

Is there a better way (performancewise) to go about the task of collecting sequential numerical data (in my case numpy arrays) than putting them in a list and then making a numpy.array out of it (I am creating a new obj and copying the data)? 是否有更好的方法(性能方面)来完成收集顺序数字数据(在我的情况下是numpy数组)的任务,而不是将它们放在一个列表中,然后从中创建一个numpy.array(我正在创建一个新的obj并复制数据)? Is there an "expandable" matrix data structure available in a well tested module? 在经过良好测试的模块中是否有“可扩展”矩阵数据结构?

A typical size of my 2d matrix would be between 100x10 and 5000x10 floats 我的2d矩阵的典型大小将介于100x10和5000x10浮点之间

EDIT: In this example i'm using map, but in my actual application I have a for loop 编辑:在这个例子中我使用map,但在我的实际应用程序中,我有一个for循环

Convenient way, using numpy.concatenate . 方便的方法,使用numpy.concatenate I believe it's also faster, than @unutbu's answer: 我相信它也比@ unutbu的答案更快:

In [32]: import numpy as np 

In [33]: list_of_arrays = list(map(lambda x: x * np.ones(2), range(5)))

In [34]: list_of_arrays
Out[34]: 
[array([ 0.,  0.]),
 array([ 1.,  1.]),
 array([ 2.,  2.]),
 array([ 3.,  3.]),
 array([ 4.,  4.])]

In [37]: shape = list(list_of_arrays[0].shape)

In [38]: shape
Out[38]: [2]

In [39]: shape[:0] = [len(list_of_arrays)]

In [40]: shape
Out[40]: [5, 2]

In [41]: arr = np.concatenate(list_of_arrays).reshape(shape)

In [42]: arr
Out[42]: 
array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.]])

Suppose you know that the final array arr will never be larger than 5000x10. 假设您知道最终的数组arr永远不会大于5000x10。 Then you could pre-allocate an array of maximum size, populate it with data as you go through the loop, and then use arr.resize to cut it down to the discovered size after exiting the loop. 然后你可以预先分配一个最大大小的数组,在循环中用数据填充它,然后在退出循环后使用arr.resize将其减少到发现的大小。

The tests below suggest doing so will be slightly faster than constructing intermediate python lists no matter what the ultimate size of the array is. 下面的测试表明,无论数组的最终大小是什么,这样做都会比构建中间python列表稍快一些。

Also, arr.resize de-allocates the unused memory, so the final (though maybe not the intermediate) memory footprint is smaller than what is used by python_lists_to_array . 此外, arr.resize取消分配未使用的内存,因此最终(尽管可能不是中间)内存占用量小于python_lists_to_array使用的内存占用量。

This shows numpy_all_the_way is faster: 这表明numpy_all_the_way更快:

% python -mtimeit -s"import test" "test.numpy_all_the_way(100)"
100 loops, best of 3: 1.78 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(1000)"
100 loops, best of 3: 18.1 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(5000)"
10 loops, best of 3: 90.4 msec per loop

% python -mtimeit -s"import test" "test.python_lists_to_array(100)"
1000 loops, best of 3: 1.97 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(1000)"
10 loops, best of 3: 20.3 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(5000)"
10 loops, best of 3: 101 msec per loop

This shows numpy_all_the_way uses less memory: 这表明numpy_all_the_way使用更少的内存:

% test.py
Initial memory usage: 19788
After python_lists_to_array: 20976
After numpy_all_the_way: 20348

test.py: test.py:

import numpy as np
import os


def memory_usage():
    pid = os.getpid()
    return next(line for line in open('/proc/%s/status' % pid).read().splitlines()
                if line.startswith('VmSize')).split()[-2]

N, M = 5000, 10


def python_lists_to_array(k):
    list_of_arrays = list(map(lambda x: x * np.ones(M), range(k)))
    arr = np.array(list_of_arrays)
    return arr


def numpy_all_the_way(k):
    arr = np.empty((N, M))
    for x in range(k):
        arr[x] = x * np.ones(M)
    arr.resize((k, M))
    return arr

if __name__ == '__main__':
    print('Initial memory usage: %s' % memory_usage())
    arr = python_lists_to_array(5000)
    print('After python_lists_to_array: %s' % memory_usage())
    arr = numpy_all_the_way(5000)
    print('After numpy_all_the_way: %s' % memory_usage())

甚至比@Gill Bates的回答简单,这里是一行代码:

np.stack(list_of_arrays, axis=0)

What you are doing is the standard way. 你正在做的是标准方式。 A property of numpy arrays is that they need contiguous memory. numpy数组的一个属性是它们需要连续的内存。 The only possibility of "holes" that I can think of is possible with the strides member of PyArrayObject , but that doesn't affect the discussion here. 我能想到的“洞”的唯一可能性是可能的strides成员PyArrayObject ,但这并不影响这里的讨论。 Since numpy arrays have contiguous memory and are "preallocated", adding a new row/column means allocating new memory, copying data, and then freeing the old memory. 由于numpy数组具有连续的内存并且是“预分配的”,因此添加新的行/列意味着分配新内存,复制数据,然后释放旧内存。 If you do that a lot, it is not very efficient. 如果你做了很多,那就不是很有效率了。

One case where someone might not want to create a list and then convert it to a numpy array in the end is when the list contains a lot of numbers: a numpy array of numbers takes much less space than a native Python list of numbers (since the native Python list stores Python objects). 有人可能不想创建列表然后将其转换为numpy数组的一种情况是列表包含大量数字:一个数字的numpy数组占用的空间比本机Python数字列表少得多(因为本机Python列表存储Python对象)。 For your typical array sizes, I don't think that is an issue. 对于典型的阵列大小,我认为这不是问题。

When you create your final array from a list of arrays, you are copying all the data to a new location for the new (2-d in your example) array. 当您从数组列表创建最终阵列,您复制(在您的示例2-d)阵列中的所有数据,为新的新位置。 This is still much more efficient than having a numpy array and doing next = numpy.vstack((next, new_row)) every time you get new data. 这比使用numpy数组和每次获取新数据时执行next = numpy.vstack((next, new_row))更有效。 vstack() will copy all the data for every "row". vstack()将复制每个“行”的所有数据。

There was a thread on numpy-discussion mailing list some time ago which discussed the possibility of adding a new numpy array type that allows efficient extending/appending. 不久前在numpy-discussion邮件列表上有一个线程讨论了添加一个允许有效扩展/追加的新numpy数组类型的可能性。 It seems there was significant interest in this at that time, although I don't know if something came out of it. 当时似乎对此有很大的兴趣,虽然我不知道是否有什么事情发生了。 You might want to look at that thread. 你可能想看看那个帖子。

I would say that what you're doing is very Pythonic, and efficient, so unless you really need something else (more space efficiency, maybe?), you should be okay. 我会说你正在做的是非常Pythonic,而且效率很高,所以除非你真的需要别的东西(更多的空间效率,也许吧?),你应该没问题。 That is how I create my numpy arrays when I don't know the number of elements in the array in the beginning. 这就是我在开始时不知道数组中元素数量时创建numpy数组的方法。

I'll add my own version of ~unutbu's answer. 我将添加我自己的~unutut的答案版本。 Similar to numpy_all_the way, but you dynamically resize if you have an index error. 与numpy_all_the方式类似,但如果您有索引错误,则动态调整大小。 I thought it would have been a little faster for small data sets, but it's a little slower - the bounds checking slows things down too much. 我认为对于小型数据集来说会更快一些,但它会慢一些 - 边界检查会使事情变得太慢。

initial_guess = 1000

def my_numpy_all_the_way(k):
    arr=np.empty((initial_guess,M))
    for x,row in enumerate(make_test_data(k)):
        try:
            arr[x]=row
        except IndexError:
            arr.resize((arr.shape[0]*2, arr.shape[1]))
            arr[x]=row
    arr.resize((k,M))
    return arr

甚至更简单的@fnjn回答

np.vstack(list_of_arrays)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM