改善Python + numpy数组分配/初始化性能

Question

I'm writing a python program, using some external functionality from DLL. 我正在使用DLL的某些外部功能编写python程序。 My problem is passing matrices (numpy arrays in python) in and out of C code, now i'm using following code to receive data from DLL: 我的问题是在C代码中传入和传出矩阵（python中的numpy数组），现在我正在使用以下代码从DLL接收数据：

peak_count = ct.c_int16()
peak_wl_array = np.zeros(512, dtype=np.double)
peak_pwr_array = np.zeros(512, dtype=np.double)

res = __dll.DLL_Search_Peaks(ctypes.c_int(data.shape[0])
                             ctypes.c_void_p(data_array.ctypes.data),
                             ctypes.c_void_p(peak_wl_array.ctypes.data),
                             ctypes.c_void_p(peak_pwr_array.ctypes.data),
                             ct.byref(peak_count))

It works like a charm, but my problem is numpy allocating speed - even without calling DLL (just commented) i've got 3.1 seconds per 100`000 calls . 它的工作原理就像一种魅力，但是我的问题是分配速度很麻木 -即使不调用DLL（只是注释），我每100`000次调用也有3.1秒 。

It's just allocating with np.zeros() and taking writeable pointer with ctypes.c_void_p(D.ctypes.data) 它只是使用np.zeros（）进行分配，并使用ctypes.c_void_p（D.ctypes.data）获取可写指针

I need to process about 20`000 calls per second, so almost all time spending on just allocating memory. 我每秒需要处理大约20`000次呼叫，因此几乎所有时间都花在分配内存上。

I think about cython, but it will not speed up numpy arrays, so i'll get no any profit. 我考虑过cython，但是它不会加快numpy数组的速度，因此我不会获得任何收益。

Is there faster way to receive matrices-like data from C-written DLL. 有没有更快的方法来从C编写的DLL接收类似矩阵的数据。

Answer 1

Memory operations are expensive, numpy or otherwise. 内存操作是昂贵的，麻木的或其他。

If you're going to be allocating a lot of arrays, it's a good idea to see if you can just do the allocation once, and either use views or subarrays to use just part of the array: 如果要分配很多数组，最好查看一下是否可以分配一次，并使用视图或子数组仅使用数组的一部分：

import numpy as np

niters=10000
asize=512

def forig():
    for i in xrange(niters):
        peak_wl_array = np.empty((asize), dtype=np.double)
        peak_pwr_array = np.empty((asize), dtype=np.double)

    return peak_pwr_array


def fviews():
    peak_wl_arrays  = np.empty((asize*niters), dtype=np.double)
    peak_pwr_arrays = np.empty((asize*niters), dtype=np.double)

    for i in xrange(niters):
        # create views
        peak_wl_array  = peak_wl_arrays[i*asize:(i+1)*asize]
        peak_pwr_array = peak_pwr_arrays[i*asize:(i+1)*asize]
        # then do something

    return peak_pwr_emptys


def fsubemptys():
    peak_wl_arrays  = np.empty((niters,asize), dtype=np.double)
    peak_pwr_arrays = np.empty((niters,asize), dtype=np.double)

    for i in xrange(niters):
        # do something with peak_wl_arrays[i,:]

    return peak_pwr_emptys


import timeit

print timeit.timeit(forig,number=100)
print timeit.timeit(fviews,number=100)
print timeit.timeit(fsubemptys,number=100)

Running gives 跑步给

3.41996979713
0.844147920609
0.00169682502747

Note that if you're using (say) np.zeros, on the other hand, you're spending most of your time initializing memory, not allocating memory, and that's always going to take substantially longer, erasing most of the difference between these approaches: 请注意，另一方面，如果您使用（例如）np.zeros，那么您将花费大部分时间来初始化内存，而不是分配内存，并且这将始终花费更长的时间，从而消除了这两者之间的大部分差异方法：

4.20200014114
5.43090081215
4.58127593994

Good single-threaded bandwidth to main memory on newer systems is going to be something like ~10GB/s (1 billion doubles/sec), so it's always going to take about 较新系统上到主内存的良好单线程带宽将约为10GB / s（10亿双/秒），因此它总是要花大约

1024 doubles/call / (1 billion doubles/sec) ~ 1 microsecond/call 1024次/通话/（10亿次/秒）〜1微秒/通话

to zero out the memory, which is already a significant chunk of the time that you're seeing. 将内存清零，这已经是您看到的时间中的很大一部分。 Still, if you initialize a single large array before making the calls, the total time of execution will be the same but the latency of each call will be lower. 但是，如果在进行调用之前初始化单个大数组，则执行的总时间将相同，但每次调用的等待时间将缩短。

改善Python + numpy数组分配/初始化性能

问题描述

1 个解决方案

解决方案1
2 2014-07-03 14:21:10

改善Python + numpy数组分配/初始化性能

问题描述

1 个解决方案

解决方案1 2 2014-07-03 14:21:10

解决方案1
2 2014-07-03 14:21:10