简体   繁体   English

增长 numpy 数值数组的最快方法

[英]Fastest way to grow a numpy numeric array

Requirements:要求:

  • I need to grow an array arbitrarily large from data.我需要从数据中增加一个任意大的数组。
  • I can guess the size (roughly 100-200) with no guarantees that the array will fit every time我可以猜测大小(大约 100-200),但不能保证数组每次都适合
  • Once it is grown to its final size, I need to perform numeric computations on it, so I'd prefer to eventually get to a 2-D numpy array.一旦它增长到最终大小,我需要对其进行数值计算,所以我更愿意最终得到一个二维 numpy 数组。
  • Speed is critical.速度至关重要。 As an example, for one of 300 files, the update() method is called 45 million times (takes 150s or so) and the finalize() method is called 500k times (takes total of 106s) ... taking a total of 250s or so.例如,对于 300 个文件中的一个,update() 方法被调用了 4500 万次(大约需要 150 秒),而 finalize() 方法被调用了 50 万次(总共需要 106 秒)……总共需要 250 秒或者。

Here is my code:这是我的代码:

def __init__(self):
    self.data = []

def update(self, row):
    self.data.append(row)

def finalize(self):
    dx = np.array(self.data)

Other things I tried include the following code ... but this is waaaaay slower.我尝试过的其他事情包括以下代码......但这是waaaaay慢。

def class A:
    def __init__(self):
        self.data = np.array([])

    def update(self, row):
        np.append(self.data, row)

    def finalize(self):
        dx = np.reshape(self.data, size=(self.data.shape[0]/5, 5))

Here is a schematic of how this is called:这是如何调用它的示意图:

for i in range(500000):
    ax = A()
    for j in range(200):
         ax.update([1,2,3,4,5])
    ax.finalize()
    # some processing on ax

I tried a few different things, with timing.我尝试了一些不同的事情,时间。

import numpy as np
  1. The method you mention as slow: (32.094 seconds)您提到的方法很慢:(32.094 秒)

     class A: def __init__(self): self.data = np.array([]) def update(self, row): self.data = np.append(self.data, row) def finalize(self): return np.reshape(self.data, newshape=(self.data.shape[0]/5, 5))
  2. Regular ol Python list: (0.308 seconds)常规 ol Python 列表:(0.308 秒)

     class B: def __init__(self): self.data = [] def update(self, row): for r in row: self.data.append(r) def finalize(self): return np.reshape(self.data, newshape=(len(self.data)/5, 5))
  3. Trying to implement an arraylist in numpy: (0.362 seconds)尝试在 numpy 中实现一个数组列表:(0.362 秒)

     class C: def __init__(self): self.data = np.zeros((100,)) self.capacity = 100 self.size = 0 def update(self, row): for r in row: self.add(r) def add(self, x): if self.size == self.capacity: self.capacity *= 4 newdata = np.zeros((self.capacity,)) newdata[:self.size] = self.data self.data = newdata self.data[self.size] = x self.size += 1 def finalize(self): data = self.data[:self.size] return np.reshape(data, newshape=(len(data)/5, 5))

And this is how I timed it:这就是我计时的方式:

x = C()
for i in xrange(100000):
    x.update([i])

So it looks like regular old Python lists are pretty good ;)所以看起来普通的旧 Python 列表非常好;)

np.append() copy all the data in the array every time, but list grow the capacity by a factor (1.125). np.append() 每次都复制数组中的所有数据,但列表将容量增加一个因子(1.125)。 list is fast, but memory usage is larger than array. list 很快,但内存使用量比 array 大。 You can use array module of the python standard library if you care about the memory.如果你关心内存,你可以使用 python 标准库的 array 模块。

Here is a discussion about this topic:下面是关于这个话题的讨论:

How to create a dynamic array 如何创建动态数组

Using the class declarations in Owen's post, here is a revised timing with some effect of the finalize.使用 Owen 的帖子中的类声明,这里是一个经过修改的时间,具有一些 finalize 效果。

In short, I find class C to provide an implementation that is over 60x faster than the method in the original post.简而言之,我发现 C 类提供的实现比原始帖子中的方法快 60 倍以上。 (apologies for the wall of text) (为文字墙道歉)

The file I used:我使用的文件:

#!/usr/bin/python
import cProfile
import numpy as np

# ... class declarations here ...

def test_class(f):
    x = f()
    for i in xrange(100000):
        x.update([i])
    for i in xrange(1000):
        x.finalize()

for x in 'ABC':
    cProfile.run('test_class(%s)' % x)

Now, the resulting timings:现在,由此产生的时间:

A: A:

     903005 function calls in 16.049 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000   16.049   16.049 <string>:1(<module>)
100000    0.139    0.000    1.888    0.000 fromnumeric.py:1043(ravel)
  1000    0.001    0.000    0.003    0.000 fromnumeric.py:107(reshape)
100000    0.322    0.000   14.424    0.000 function_base.py:3466(append)
100000    0.102    0.000    1.623    0.000 numeric.py:216(asarray)
100000    0.121    0.000    0.298    0.000 numeric.py:286(asanyarray)
  1000    0.002    0.000    0.004    0.000 test.py:12(finalize)
     1    0.146    0.146   16.049   16.049 test.py:50(test_class)
     1    0.000    0.000    0.000    0.000 test.py:6(__init__)
100000    1.475    0.000   15.899    0.000 test.py:9(update)
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
100000    0.126    0.000    0.126    0.000 {method 'ravel' of 'numpy.ndarray' objects}
  1000    0.002    0.000    0.002    0.000 {method 'reshape' of 'numpy.ndarray' objects}
200001    1.698    0.000    1.698    0.000 {numpy.core.multiarray.array}
100000   11.915    0.000   11.915    0.000 {numpy.core.multiarray.concatenate}

B:乙:

     208004 function calls in 16.885 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.001    0.001   16.885   16.885 <string>:1(<module>)
  1000    0.025    0.000   16.508    0.017 fromnumeric.py:107(reshape)
  1000    0.013    0.000   16.483    0.016 fromnumeric.py:32(_wrapit)
  1000    0.007    0.000   16.445    0.016 numeric.py:216(asarray)
     1    0.000    0.000    0.000    0.000 test.py:16(__init__)
100000    0.068    0.000    0.080    0.000 test.py:19(update)
  1000    0.012    0.000   16.520    0.017 test.py:23(finalize)
     1    0.284    0.284   16.883   16.883 test.py:50(test_class)
  1000    0.005    0.000    0.005    0.000 {getattr}
  1000    0.001    0.000    0.001    0.000 {len}
100000    0.012    0.000    0.012    0.000 {method 'append' of 'list' objects}
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1000    0.020    0.000    0.020    0.000 {method 'reshape' of 'numpy.ndarray' objects}
  1000   16.438    0.016   16.438    0.016 {numpy.core.multiarray.array}

C: C:

     204010 function calls in 0.244 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000    0.244    0.244 <string>:1(<module>)
  1000    0.001    0.000    0.003    0.000 fromnumeric.py:107(reshape)
     1    0.000    0.000    0.000    0.000 test.py:27(__init__)
100000    0.082    0.000    0.170    0.000 test.py:32(update)
100000    0.087    0.000    0.088    0.000 test.py:36(add)
  1000    0.002    0.000    0.005    0.000 test.py:46(finalize)
     1    0.068    0.068    0.243    0.243 test.py:50(test_class)
  1000    0.000    0.000    0.000    0.000 {len}
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1000    0.002    0.000    0.002    0.000 {method 'reshape' of 'numpy.ndarray' objects}
     6    0.001    0.000    0.001    0.000 {numpy.core.multiarray.zeros}

Class A is destroyed by the updates, class B is destroyed by the finalizes. A 类被更新销毁,B 类被终结销毁。 Class C is robust in the face of both of them.面对这两种情况,C 类是健壮的。

there is a big performance difference in the function that you use for finalization.您用于完成的函数存在很大的性能差异。 Consider the following code:考虑以下代码:

N=100000
nruns=5

a=[]
for i in range(N):
    a.append(np.zeros(1000))

print "start"

b=[]
for i in range(nruns):
    s=time()
    c=np.vstack(a)
    b.append((time()-s))
print "Timing version vstack ",np.mean(b)

b=[]
for i in range(nruns):
    s=time()
    c1=np.reshape(a,(N,1000))
    b.append((time()-s))

print "Timing version reshape ",np.mean(b)

b=[]
for i in range(nruns):
    s=time()
    c2=np.concatenate(a,axis=0).reshape(-1,1000)
    b.append((time()-s))

print "Timing version concatenate ",np.mean(b)

print c.shape,c2.shape
assert (c==c2).all()
assert (c==c1).all()

Using concatenate seems to be twice as fast as the first version and more than 10 times faster than the second version.使用 concatenate 似乎比第一个版本快两倍,比第二个版本快 10 倍以上。

Timing version vstack  1.5774928093
Timing version reshape  9.67419199944
Timing version concatenate  0.669512557983

If you want improve performance with list operations, have a look to blist library.如果您想通过列表操作提高性能,请查看 blist 库。 It is a optimized implementation of python list and other structures.它是python列表和其他结构的优化实现。

I didn't benchmark it yet but the results in their page seem promising.我还没有对它进行基准测试,但他们页面中的结果似乎很有希望。

Multiple dimensional numpy arrays多维numpy数组

Adding to Owen's and Prashant Kumar post a version using multiple dimensional numpy arrays (aka. shape) speeds up the code for the numpy solutions.添加到 Owen 和 Prashant Kumar 发布了一个使用多维 numpy 数组(又名形状)的版本,加速了 numpy 解决方案的代码。 Especially if you need to access ( finalize() ) the data often.特别是如果您需要经常访问( finalize() )数据。

Version版本 Prashant Kumar普拉尚·库马尔 row_length=1行长度=1 row_length=5行长度=5
Class A - np.append A类 - np.append 2.873 s 2.873 秒 2.776 s 2.776 秒 0.682 s 0.682 秒
Class B - python list B类-python列表 6.693 s 6.693 秒 80.868 s 80.868 秒 22.012 s 22.012 秒
Class C - arraylist C 类 - 数组列表 0.095 s 0.095 秒 0.180 s 0.180 秒 0.043 s 0.043 秒

The column Prashant Kumar is his example executed on my machine to give a comparison. Prashant Kumar列是他在我的机器上执行的示例,用于进行比较。 With row_length=5 it is the example of the initial question. row_length=5是初始问题的示例。 The dramatic increase in the python list , comes from {built-in method numpy.array} , which means numpy needs a lot more time to convert a multiple dimensional list of lists to an array in respect to a 1D list and reshape it where both have the same number entries, eg np.array([[1,2,3]*5]) vs. np.array([1]*15).reshape((-1,3)) . python list的显着增加来自{built-in method numpy.array} ,这意味着 numpy 需要更多时间将列表的多维列表转换为相对于一维列表的数组并对其进行整形具有相同数量的条目,例如np.array([[1,2,3]*5])np.array([1]*15).reshape((-1,3))

And this is the code:这是代码:

import cProfile
import numpy as np

class A:
    def __init__(self,shape=(0,), dtype=float):
        """First item of shape is ingnored, the rest defines the shape"""
        self.data = np.array([], dtype=dtype).reshape((0,*shape[1:]))

    def update(self, row):
        self.data = np.append(self.data, row)

    def finalize(self):
        return self.data
    
    
class B:
    def __init__(self, shape=(0,), dtype=float):
        """First item of shape is ingnored, the rest defines the shape"""
        self.shape = shape
        self.dtype = dtype 
        self.data = []

    def update(self, row):
        self.data.append(row)

    def finalize(self):
        return np.array(self.data, dtype=self.dtype).reshape((-1, *self.shape[1:]))
    
    
class C:
    def __init__(self, shape=(0,), dtype=float):
        """First item of shape is ingnored, the rest defines the shape"""
        self.shape = shape
        self.data = np.zeros((100,*shape[1:]),dtype=dtype)
        self.capacity = 100
        self.size = 0

    def update(self, x):
        if self.size == self.capacity:
            self.capacity *= 4
            newdata = np.zeros((self.capacity,*self.data.shape[1:]))
            newdata[:self.size] = self.data
            self.data = newdata

        self.data[self.size] = x
        self.size += 1

    def finalize(self):
        return self.data[:self.size]
    

def test_class(f):
    row_length = 5
    x = f(shape=(0,row_length))
    for i in range(int(100000/row_length)):
        x.update([i]*row_length)
    for i in range(1000):
        x.finalize()

for x in 'ABC':
    cProfile.run('test_class(%s)' % x)

And another option to add to the post above from Luca Fiaschi .以及从 Luca Fiaschi添加到上述帖子的另一个选项。

b=[]
for i in range(nruns):
    s=time.time()
    c1=np.array(a, dtype=int).reshape((N,1000))
    b.append((time.time()-s))
    
print("Timing version array.reshape ",np.mean(b))

gives for me:给我:

Timing version vstack         0.6863266944885253
Timing version reshape        0.505419111251831
Timing version array.reshape  0.5052066326141358
Timing version concatenate    0.5339600563049316

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM