Python：重寫循環numpy數學函數以在GPU上運行

Question

有人可以幫我重寫這個函數（ doTheMath函數）來在GPU上進行計算嗎？ 我現在用了好幾天試圖繞過它，但沒有結果。 我想也許有人可以幫助我以你認為適合作為日志的任何方式重寫這個函數，因為我在最后給出了相同的結果。 我試圖使用來自numba @jit ，但由於某種原因它實際上比常規運行代碼要慢得多。 由於樣本量很大，我們的目標是大大縮短執行時間，因此我相信GPU是最快的方法。

我會解釋一下實際發生的事情。 實際數據看起來幾乎與下面代碼中創建的樣本數據完全相同，每個樣本分為大約5.000.000行的樣本大小或每個文件大約150MB。 總共有大約600.000.000行或20GB的數據。 我必須循環遍歷這些數據，逐個樣本然后在每個樣本中逐行，獲取每行的最后2000行（或另一行）並運行doTheMath函數，該函數返回結果。 然后將該結果保存回硬盤驅動器，我可以使用另一個程序執行其他操作。 如下所示，我不需要所有行的所有結果，只需要大於特定數量的行。 如果我現在在python中運行我的函數，那么每1.000.000行大約需要62秒。 考慮到所有數據以及應該用多快的速度，這是一段很長的時間。

我必須提一下，我借助data = joblib.load(file)將文件上傳真實數據文件到RAM，所以上傳數據不是問題，因為每個文件只需要大約0.29秒。 上傳后，我運行下面的整個代碼。 最長時間是doTheMath函數。 我願意在stackoverflow上給出我所有的500個聲望點作為對願意幫助我重寫這個簡單代碼以在GPU上運行的人的獎勵。 我的興趣特別在於GPU，我真的很想看看它是如何解決這個問題的。

編輯/更新1：這是一個指向真實數據的小樣本的鏈接： data_csv.zip大約102000行真實數據1和2000行用於真實數據2a和data2b。 對實際樣本數據使用minimumLimit = 400

編輯/更新2：對於這篇文章后面的人，這里是以下答案的簡短摘要。 到目前為止，我們對原始解決方案有4個答案。 @Divakar提供的那個只是對原始代碼的調整。 在這兩個調整中，只有第一個實際上適用於這個問題，第二個是一個很好的調整但不適用於此。 在其他三個答案中，其中兩個是基於CPU的解決方案和一個tensorflow-GPU嘗試。 Paul Panzer的Tensorflow-GPU似乎很有前景，但是當我在GPU上實際運行它時它比原來慢，所以代碼仍然需要改進。

另外兩個基於CPU的解決方案由@PaulPanzer（一個純粹的numpy解決方案）和@MSeifert（一個numba解決方案）提交。 與原始代碼相比，這兩種解決方案都能提供非常好的結果和兩種處理數據。 在Paul Panzer提交的兩個中，速度更快。 它在大約3秒內處理大約1.000.000行。 唯一的問題是較小的batchSizes，這可以通過切換到MSeifert提供的numba解決方案，或者甚至是在下面討論的所有調整之后的原始代碼來克服。

我非常高興並感謝@PaulPanzer和@MSeifert所做的關於他們答案的工作。 不過，由於這是一個關於基於GPU的解決方案的問題，我等着看是否有人願意嘗試GPU版本，看看與當前的CPU相比，GPU上的數據處理速度有多快解決方案。 如果沒有其他答案勝過@PaperPanzer的純粹numpy解決方案那么我會接受他的答案作為正確的答案並得到賞金:)

編輯/更新3： @Divakar已經發布了一個新的答案與GPU的解決方案。 在對真實數據進行測試之后，速度甚至與CPU對應解決方案無法相比。 GPU在大約1.5秒內處理大約5.000.000。 這太不可思議了:)我對GPU解決方案感到非常興奮，感謝@Divakar發布它。 我感謝@PaulPanzer和@MSeifert的CPU解決方案:)現在我的研究繼續以令人難以置信的速度歸功於GPU :)

import pandas as pd
import numpy as np
import time

def doTheMath(tmpData1, data2a, data2b):
    A = tmpData1[:, 0]
    B = tmpData1[:,1]
    C = tmpData1[:,2]
    D = tmpData1[:,3]
    Bmax = B.max()
    Cmin  = C.min()
    dif = (Bmax - Cmin)
    abcd = ((((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
    return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()

#Declare variables
batchSize = 2000
sampleSize = 5000000
resultArray = []
minimumLimit = 490 #use 400 on the real sample data 

#Create Random Sample Data
data1 = np.matrix(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.matrix(np.random.uniform(0, 1, (batchSize, 1))) #upper limit
data2b = np.matrix(np.random.uniform(0, 1, (batchSize, 1))) #lower limit
#approx. half of data2a will be smaller than data2b, but that is only in the sample data because it is randomly generated, NOT the real data. The real data2a is always higher than data2b.


#Loop through the data
t0 = time.time()
for rowNr in  range(data1.shape[0]):
    tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
    if(tmp_df.shape[0] == batchSize):
        result = doTheMath(tmp_df, data2a, data2b)
        if (result >= minimumLimit):
            resultArray.append([rowNr , result])
print('Runtime:', time.time() - t0)

#Save data results
resultArray = np.array(resultArray)
print(resultArray[:,1].sum())
resultArray = pd.DataFrame({'index':resultArray[:,0], 'result':resultArray[:,1]})
resultArray.to_csv("Result Array.csv", sep=';')

我正在研究的PC規格：

GTX970(4gb) video card; 
i7-4790K CPU 4.00Ghz; 
16GB RAM;
a SSD drive 
running Windows 7;

作為一個附帶問題，SLI中的第二張顯卡會幫助解決這個問題嗎？

Answer 1

調整＃1

通常建議在使用NumPy數組時對事物進行矢量化。 但是對於非常大的陣列，我認為你沒有選擇。 因此，為了提高性能，可以在求和的最后一步進行微調。

我們可以替換生成1s和0s數組的步驟並進行求和：

np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()

使用np.count_nonzero可有效地計算布爾數組中的True值，而不是轉換為1s和0s -

np.count_nonzero((abcd <= data2a) & (abcd >= data2b))

運行時測試 -

In [45]: abcd = np.random.randint(11,99,(10000))

In [46]: data2a = np.random.randint(11,99,(10000))

In [47]: data2b = np.random.randint(11,99,(10000))

In [48]: %timeit np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
10000 loops, best of 3: 81.8 µs per loop

In [49]: %timeit np.count_nonzero((abcd <= data2a) & (abcd >= data2b))
10000 loops, best of 3: 28.8 µs per loop

調整＃2

在處理經歷隱式廣播的案件時，使用預先計算的倒數。 here有更多信息。 因此，存儲dif倒數並在步驟中使用它：

((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ...

樣品測試 -

In [52]: A = np.random.rand(10000)

In [53]: dif = 0.5

In [54]: %timeit A/dif
10000 loops, best of 3: 25.8 µs per loop

In [55]: %timeit A*(1.0/dif)
100000 loops, best of 3: 7.94 µs per loop

您可以選擇使用除以四個地dif 。 所以，希望這也會帶來顯着的提升！

Answer 2

介紹和解決方案代碼

好吧，你問了！ 因此，本文中列出的是PyCUDA一個實現，它使用輕量級包裝器擴展了Python環境中CUDA的大部分功能。 我們將使用SourceModule功能，讓我們編寫和編譯CUDA內核，保留在Python環境中。

在所涉及的計算中，我們得到了最大和最小的滑動，很少的差異和划分以及比較。 對於在最大和最小的部分，涉及塊最大發現（每個滑動窗口），我們將使用還原技術，因為在一些詳細的討論here 。 這將在塊級別完成。 對於跨滑動窗口的上層迭代，我們將使用網格級索引到CUDA資源。 有關此塊和網格格式的更多信息，請參閱page-18 。 PyCUDA還支持用於計算max和min等縮減的內置函數，但是我們失去了控制權，特別是我們打算使用專用內存，如共享和常量內存，以便在接近最佳水平的情況下利用GPU。

列出PyCUDA-NumPy解決方案代碼 -

1] PyCUDA部分 -

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda.compiler import SourceModule

mod = SourceModule("""
#define TBP 1024 // THREADS_PER_BLOCK

__device__ void get_Bmax_Cmin(float* out, float *d1, float *d2, int L, int offset)
{
    int tid = threadIdx.x;
    int inv = TBP;
    __shared__ float dS[TBP][2];

    dS[tid][0] = d1[tid+offset];  
    dS[tid][1] = d2[tid+offset];         
    __syncthreads();

    if(tid<L-TBP)  
    {
        dS[tid][0] = fmaxf(dS[tid][0] , d1[tid+inv+offset]);
        dS[tid][1] = fminf(dS[tid][1] , d2[tid+inv+offset]);
    }
    __syncthreads();
    inv = inv/2;

    while(inv!=0)   
    {
        if(tid<inv)
        {
            dS[tid][0] = fmaxf(dS[tid][0] , dS[tid+inv][0]);
            dS[tid][1] = fminf(dS[tid][1] , dS[tid+inv][1]);
        }
        __syncthreads();
        inv = inv/2;
    }
    __syncthreads();

    if(tid==0)
    {
        out[0] = dS[0][0];
        out[1] = dS[0][1];
    }   
    __syncthreads();
}

__global__ void main1(float* out, float *d0, float *d1, float *d2, float *d3, float *lowL, float *highL, int *BLOCKLEN)
{
    int L = BLOCKLEN[0];
    int tid = threadIdx.x;
    int iterID = blockIdx.x;
    float Bmax_Cmin[2];
    int inv;
    float Cmin, dif;   
    __shared__ float dS[TBP*2];   

    get_Bmax_Cmin(Bmax_Cmin, d1, d2, L, iterID);  
    Cmin = Bmax_Cmin[1];
    dif = (Bmax_Cmin[0] - Cmin);

    inv = TBP;

    dS[tid] = (d0[tid+iterID] + d1[tid+iterID] + d2[tid+iterID] + d3[tid+iterID] - 4.0*Cmin) / (4.0*dif);
    __syncthreads();

    if(tid<L-TBP)  
        dS[tid+inv] = (d0[tid+inv+iterID] + d1[tid+inv+iterID] + d2[tid+inv+iterID] + d3[tid+inv+iterID] - 4.0*Cmin) / (4.0*dif);                   

     dS[tid] = ((dS[tid] >= lowL[tid]) & (dS[tid] <= highL[tid])) ? 1 : 0;
     __syncthreads();

     if(tid<L-TBP)
         dS[tid] += ((dS[tid+inv] >= lowL[tid+inv]) & (dS[tid+inv] <= highL[tid+inv])) ? 1 : 0;
     __syncthreads();

    inv = inv/2;
    while(inv!=0)   
    {
        if(tid<inv)
            dS[tid] += dS[tid+inv];
        __syncthreads();
        inv = inv/2;
    }

    if(tid==0)
        out[iterID] = dS[0];
    __syncthreads();

}
""")

請注意， THREADS_PER_BLOCK, TBP將根據batchSize進行設置。 這里的經驗法則是為TBP分配2值的功率，該值僅小於batchSize 。 因此，對於batchSize = 2000 ，我們需要TBP為1024 。

2] NumPy部分 -

def gpu_app_v1(A, B, C, D, batchSize, minimumLimit):
    func1 = mod.get_function("main1")
    outlen = len(A)-batchSize+1

    # Set block and grid sizes
    BSZ = (1024,1,1)
    GSZ = (outlen,1)

    dest = np.zeros(outlen).astype(np.float32)
    N = np.int32(batchSize)
    func1(drv.Out(dest), drv.In(A), drv.In(B), drv.In(C), drv.In(D), \
                     drv.In(data2b), drv.In(data2a),\
                     drv.In(N), block=BSZ, grid=GSZ)
    idx = np.flatnonzero(dest >= minimumLimit)
    return idx, dest[idx]

標桿

我在GTX 960M上測試過。 請注意，PyCUDA希望數組具有連續的順序。 因此，我們需要對列進行切片並進行復制。 我期待/假設可以從文件中讀取數據，使得數據沿着行傳播而不是作為列傳播。 因此，暫時將它們排除在基准測試功能之外。

原創方法 -

def org_app(data1, batchSize, minimumLimit):
    resultArray = []
    for rowNr in  range(data1.shape[0]-batchSize+1):
        tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
        result = doTheMath(tmp_df, data2a, data2b)
        if (result >= minimumLimit):
            resultArray.append([rowNr , result]) 
    return resultArray

時間和驗證 -

In [2]: #Declare variables
   ...: batchSize = 2000
   ...: sampleSize = 50000
   ...: resultArray = []
   ...: minimumLimit = 490 #use 400 on the real sample data
   ...: 
   ...: #Create Random Sample Data
   ...: data1 = np.random.uniform(1, 100000, (sampleSize + batchSize, 4)).astype(np.float32)
   ...: data2b = np.random.uniform(0, 1, (batchSize)).astype(np.float32)
   ...: data2a = data2b + np.random.uniform(0, 1, (batchSize)).astype(np.float32)
   ...: 
   ...: # Make column copies
   ...: A = data1[:,0].copy()
   ...: B = data1[:,1].copy()
   ...: C = data1[:,2].copy()
   ...: D = data1[:,3].copy()
   ...: 
   ...: gpu_out1,gpu_out2 = gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
   ...: cpu_out1,cpu_out2 = np.array(org_app(data1, batchSize, minimumLimit)).T
   ...: print(np.allclose(gpu_out1, cpu_out1))
   ...: print(np.allclose(gpu_out2, cpu_out2))
   ...: 
True
False

因此，CPU和GPU計數之間存在一些差異。 讓我們調查一下 -

In [7]: idx = np.flatnonzero(~np.isclose(gpu_out2, cpu_out2))

In [8]: idx
Out[8]: array([12776, 15208, 17620, 18326])

In [9]: gpu_out2[idx] - cpu_out2[idx]
Out[9]: array([-1., -1.,  1.,  1.])

有四個不匹配計數的實例。 這些最多關閉1 。 通過研究，我發現了一些有關這方面的信息。 基本上，因為我們使用數學內在函數進行最大和最小計算，而我認為這些因素導致浮動pt表示中的最后一個二進制位與CPU對應位置不同。 這被稱為ULP錯誤，並在here和here詳細討論here 。

最后，把問題放在一邊，讓我們來看看最重要的一點，即表現 -

In [10]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 2.18 s per loop

In [11]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
10 loops, best of 3: 82.5 ms per loop

In [12]: 2180.0/82.5
Out[12]: 26.424242424242426

讓我們嘗試更大的數據集。 使用sampleSize = 500000 ，我們得到 -

In [14]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 23.2 s per loop

In [15]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
1 loops, best of 3: 821 ms per loop

In [16]: 23200.0/821
Out[16]: 28.25822168087698

因此，加速保持在27左右。

限制：

1）我們正在使用float32數字，因為GPU最適合這些。 特別是在非服務器GPU上的雙精度在性能方面並不受歡迎，因為你正在使用這樣的GPU，我用float32進行了測試。

進一步改進：

1）我們可以使用更快的constant memory來輸入data2a和data2b ，而不是使用global memory 。

Answer 3

在開始調整目標（GPU）或使用其他任何東西（即並行執行）之前，您可能需要考慮如何改進現有代碼。 您使用了numba -tag，因此我將使用它來改進代碼：首先，我們對不在矩陣上的數組進行操作：

data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit

每次調用doTheMath都會返回一個整數，但是你使用了很多數組並創建了很多中間數組：

abcd = ((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ((C   - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()

這會在每一步創建一個中間數組：

tmp1 = A-Cmin ，
tmp2 = tmp1 / dif ，
tmp3 = B - Cmin ，
tmp4 = tmp3 / dif
......你得到了要點。

然而，這是一個reduce函數（array - > integer），因此擁有大量中間數組是不必要的權重，只需計算“fly”的值即可。

import numba as nb

@nb.njit
def doTheMathNumba(tmpData, data2a, data2b):
    Bmax = np.max(tmpData[:, 1])
    Cmin = np.min(tmpData[:, 2])
    diff = (Bmax - Cmin)
    idiff = 1 / diff
    sum_ = 0
    for i in range(tmpData.shape[0]):
        val = (tmpData[i, 0] + tmpData[i, 1] + tmpData[i, 2] + tmpData[i, 3]) / 4 * idiff - Cmin * idiff
        if val <= data2a[i] and val >= data2b[i]:
            sum_ += 1
    return sum_

我在這里做了其他事情以避免多次操作：

(((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4
= ((A - Cmin + B - Cmin + C - Cmin + D - Cmin) / dif) / 4
= (A + B + C + D - 4 * Cmin) / (4 * dif)
= (A + B + C + D) / (4 * dif) - (Cmin / dif)

這實際上在我的計算機上將執行時間縮短了近10倍：

%timeit doTheMath(tmp_df, data2a, data2b)       # 1000 loops, best of 3: 446 µs per loop
%timeit doTheMathNumba(tmp_df, data2a, data2b)  # 10000 loops, best of 3: 59 µs per loop

當然還有其他改進，例如使用滾動最小值/最大值來計算Bmax和Cmin ，這將使得至少部分計算在O(sampleSize)而不是O(sampleSize) O(samplesize * batchsize) 。 這也可以重復使用一些(A + B + C + D) / (4 * dif) - (Cmin / dif)計算，因為如果Cmin和Bmax在下一個樣本中沒有變化，這些值就會消失。不同。 這樣做有點復雜，因為比較不同。 但絕對可能！ 看這里：

import time
import numpy as np
import numba as nb

@nb.njit
def doTheMathNumba(abcd, data2a, data2b, Bmax, Cmin):
    diff = (Bmax - Cmin)
    idiff = 1 / diff
    quarter_idiff = 0.25 * idiff
    sum_ = 0
    for i in range(abcd.shape[0]):
        val = abcd[i] * quarter_idiff - Cmin * idiff
        if val <= data2a[i] and val >= data2b[i]:
            sum_ += 1
    return sum_

@nb.njit
def doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, resultArray):
    found = 0
    for rowNr in range(data1.shape[0]):
        if(abcd[rowNr:rowNr + batchSize].shape[0] == batchSize):
            result = doTheMathNumba(abcd[rowNr:rowNr + batchSize], 
                                    data2a, data2b, Bmaxs[rowNr], Cmins[rowNr])
            if (result >= minimumLimit):
                resultArray[found, 0] = rowNr
                resultArray[found, 1] = result
                found += 1
    return resultArray[:found]

#Declare variables
batchSize = 2000
sampleSize = 50000
resultArray = []
minimumLimit = 490 #use 400 on the real sample data 

data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit

from scipy import ndimage
t0 = time.time()
abcd = np.sum(data1, axis=1)
Bmaxs = ndimage.maximum_filter1d(data1[:, 1], 
                                 size=batchSize, 
                                 origin=-((batchSize-1)//2-1))  # correction for even shapes
Cmins = ndimage.minimum_filter1d(data1[:, 2], 
                                 size=batchSize, 
                                 origin=-((batchSize-1)//2-1))

result = np.zeros((sampleSize, 2), dtype=np.int64)
doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, result)
print('Runtime:', time.time() - t0)

這給了我一個Runtime: 0.759593152999878 （在numba編譯函數之后！），而你的原始版本有Runtime: 24.68975639343262 。 現在我們快了30倍！

你的樣本量仍然需要Runtime: 60.187848806381226但這不算太糟，對吧？

即使我自己沒有這樣做， numba說可以為CUDA GPU編寫“Numba”並且它似乎並不復雜。

Answer 4

下面是一些代碼，通過調整算法來演示可能的內容。 這是純粹的numpy，但是你發布的樣本數據比原始版本提供了大約35倍的速度（在我相當適中的機器上大約250,000個樣本~2.5秒）：

>>> result_dict = master('run')
[('load', 0.82578349113464355), ('precomp', 0.028138399124145508), ('max/min', 0.24333405494689941), ('ABCD', 0.015314102172851562), ('main', 1.3356468677520752)]
TOTAL 2.44821691513

調整使用：

A + B + C + D，請參閱我的其他答案
運行min / max，包括避免以相同的Cmin / dif多次計算（A + B + C + D - 4Cmin）/（4dif）。

這些或多或少是常規的。 這留下了與data2a / b的比較，這是昂貴的O（NK），其中N是樣本的數量，K是窗口的大小。 在這里，人們可以利用相對良好的數據。 使用運行的最小值/最大值可以創建data2a / b的變體，可以用於一次測試一系列窗口偏移，如果測試失敗，可以立即排除所有這些偏移，否則范圍被平分。

import numpy as np
import time

# global variables; they will hold the precomputed pre-screening filters
preA, preB = {}, {}
CHUNK_SIZES = None

def sliding_argmax(data, K=2000):
    """compute the argmax of data over a sliding window of width K

    returns:
        indices  -- indices into data
        switches -- window offsets at which the maximum changes
                    (strictly speaking: where the index of the maximum changes)
                    excludes 0 but includes maximum offset (len(data)-K+1)

    see last line of compute_pre_screening_filter for a recipe to convert
    this representation to the vector of maxima
    """
    N = len(data)
    last = np.argmax(data[:K])
    indices = [last]
    while indices[-1] <= N - 1:
        ge = np.where(data[last + 1 : last + K + 1] > data[last])[0]
        if len(ge) == 0:
            if last + K >= N:
                break
            last += 1 + np.argmax(data[last + 1 : last + K + 1])
            indices.append(last)
        else:
            last += 1 + ge[0]
            indices.append(last)
    indices = np.array(indices)
    switches = np.where(data[indices[1:]] > data[indices[:-1]],
                        indices[1:] + (1-K), indices[:-1] + 1)
    return indices, np.r_[switches, [len(data)-K+1]]


def compute_pre_screening_filter(bound, n_offs):
    """compute pre-screening filter for point-wise upper bound

    given a K-vector of upper bounds B and K+n_offs-1-vector data
    compute K+n_offs-1-vector filter such that for each index j
    if for any offset 0 <= o < n_offs and index 0 <= i < K such that
    o + i = j, the inequality B_i >= data_j holds then filter_j >= data_j

    therefore the number of data points below filter is an upper bound for
    the maximum number of points below bound in any K-window in data
    """
    pad_l, pad_r = np.min(bound[:n_offs-1]), np.min(bound[1-n_offs:])
    padded = np.r_[pad_l+np.zeros(n_offs-1,), bound, pad_r+np.zeros(n_offs-1,)]
    indices, switches = sliding_argmax(padded, n_offs)
    return padded[indices].repeat(np.diff(np.r_[[0], switches]))


def compute_all_pre_screening_filters(upper, lower, min_chnk=5, dyads=6):
    """compute upper and lower pre-screening filters for data blocks of
    sizes K+n_offs-1 where
    n_offs = min_chnk, 2min_chnk, ..., 2^(dyads-1)min_chnk

    the result is stored in global variables preA and preB
    """
    global CHUNK_SIZES

    CHUNK_SIZES = min_chnk * 2**np.arange(dyads)
    preA[1] = upper
    preB[1] = lower
    for n in CHUNK_SIZES:
        preA[n] = compute_pre_screening_filter(upper, n)
        preB[n] = -compute_pre_screening_filter(-lower, n)


def test_bounds(block, counts, threshold=400):
    """test whether the windows fitting in the data block 'block' fall
    within the bounds using pre-screening for efficient bulk rejection

    array 'counts' will be overwritten with the counts of compliant samples
    note that accurate counts will only be returned for above threshold
    windows, because the analysis of bulk rejected windows is short-circuited

    also note that bulk rejection only works for 'well behaved' data and
    for example not on random numbers
    """
    N = len(counts)
    K = len(preA[1])
    r = N % CHUNK_SIZES[0]
    # chop up N into as large as possible chunks with matching pre computed
    # filters
    # start with small and work upwards
    counts[:r] = [np.count_nonzero((block[l:l+K] <= preA[1]) &
                                   (block[l:l+K] >= preB[1]))
                  for l in range(r)]

    def bisect(block, counts):
        M = len(counts)
        cnts = np.count_nonzero((block <= preA[M]) & (block >= preB[M]))
        if cnts < threshold:
            counts[:] = cnts
            return
        elif M == CHUNK_SIZES[0]:
            counts[:] = [np.count_nonzero((block[l:l+K] <= preA[1]) &
                                          (block[l:l+K] >= preB[1]))
                         for l in range(M)]
        else:
            M //= 2
            bisect(block[:-M], counts[:M])
            bisect(block[M:], counts[M:])

    N = N // CHUNK_SIZES[0]
    for M in CHUNK_SIZES:
        if N % 2:
            bisect(block[r:r+M+K-1], counts[r:r+M])
            r += M
        elif N == 0:
            return
        N //= 2
    else:
        for j in range(2*N):
            bisect(block[r:r+M+K-1], counts[r:r+M])
            r += M


def analyse(data, use_pre_screening=True, min_chnk=5, dyads=6,
            threshold=400):
    samples, upper, lower = data
    N, K = samples.shape[0], upper.shape[0]
    times = [time.time()]
    if use_pre_screening:
        compute_all_pre_screening_filters(upper, lower, min_chnk, dyads)
    times.append(time.time())
    # compute switching points of max and min for running normalisation
    upper_inds, upper_swp = sliding_argmax(samples[:, 1], K)
    lower_inds, lower_swp = sliding_argmax(-samples[:, 2], K)
    times.append(time.time())
    # sum columns
    ABCD = samples.sum(axis=-1)
    times.append(time.time())
    counts = np.empty((N-K+1,), dtype=int)
    # main loop
    # loop variables:
    offs = 0
    u_ind, u_scale, u_swp = 0, samples[upper_inds[0], 1], upper_swp[0]
    l_ind, l_scale, l_swp = 0, samples[lower_inds[0], 2], lower_swp[0]
    while True:
        # check which is switching next, min(C) or max(B)
        if u_swp > l_swp:
            # greedily take the largest block possible such that dif and Cmin
            # do not change
            block = (ABCD[offs:l_swp+K-1] - 4*l_scale) \
                    * (0.25 / (u_scale-l_scale))
            if use_pre_screening:
                test_bounds(block, counts[offs:l_swp], threshold=threshold)
            else:
                counts[offs:l_swp] = [
                    np.count_nonzero((block[l:l+K] <= upper) &
                                     (block[l:l+K] >= lower))
                    for l in range(l_swp - offs)]
            # book keeping
            l_ind += 1
            offs = l_swp
            l_swp = lower_swp[l_ind]
            l_scale = samples[lower_inds[l_ind], 2]
        else:
            block = (ABCD[offs:u_swp+K-1] - 4*l_scale) \
                    * (0.25 / (u_scale-l_scale))
            if use_pre_screening:
                test_bounds(block, counts[offs:u_swp], threshold=threshold)
            else:
                counts[offs:u_swp] = [
                    np.count_nonzero((block[l:l+K] <= upper) &
                                     (block[l:l+K] >= lower))
                    for l in range(u_swp - offs)]
            u_ind += 1
            if u_ind == len(upper_inds):
                assert u_swp == N-K+1
                break
            offs = u_swp
            u_swp = upper_swp[u_ind]
            u_scale = samples[upper_inds[u_ind], 1]
    times.append(time.time())
    return {'counts': counts, 'valid': np.where(counts >= 400)[0],
            'timings': np.diff(times)}


def master(mode='calibrate', data='fake', use_pre_screening=True, nrep=3,
           min_chnk=None, dyads=None):
    t = time.time()
    if data in ('fake', 'load'):
        data1 = np.loadtxt('data1.csv', delimiter=';', skiprows=1,
                           usecols=[1,2,3,4])
        data2a = np.loadtxt('data2a.csv', delimiter=';', skiprows=1,
                            usecols=[1])
        data2b = np.loadtxt('data2b.csv', delimiter=';', skiprows=1,
                            usecols=[1])
        if data == 'fake':
            data1 = np.tile(data1, (10, 1))
        threshold = 400
    elif data == 'random':
        data1 = np.random.random((102000, 4))
        data2b = np.random.random(2000)
        data2a = np.random.random(2000)
        threshold = 490
        if use_pre_screening or mode == 'calibrate':
            print('WARNING: pre-screening not efficient on artificial data')
    else:
        raise ValueError("data mode {} not recognised".format(data))
    data = data1, data2a, data2b
    t_load = time.time() - t
    if mode == 'calibrate':
        min_chnk = (2, 3, 4, 5, 6) if min_chnk is None else min_chnk
        dyads = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) if dyads is None else dyads
        timings = np.zeros((len(min_chnk), len(dyads)))
        print('max bisect  ' + ' '.join([
            '   n.a.' if dy == 0 else '{:7d}'.format(dy) for dy in dyads]),
              end='')
        for i, mc in enumerate(min_chnk):
            print('\nmin chunk {}'.format(mc), end=' ')
            for j, dy in enumerate(dyads):
                for k in range(nrep):
                    if dy == 0: # no pre-screening
                        timings[i, j] += analyse(
                            data, False, mc, dy, threshold)['timings'][3]
                    else:
                        timings[i, j] += analyse(
                            data, True, mc, dy, threshold)['timings'][3]
                timings[i, j] /= nrep
                print('{:7.3f}'.format(timings[i, j]), end=' ', flush=True)
        best_mc, best_dy = np.unravel_index(np.argmin(timings.ravel()),
                                            timings.shape)
        print('\nbest', min_chnk[best_mc], dyads[best_dy])
        return timings, min_chnk[best_mc], dyads[best_dy]
    if mode == 'run':
        min_chnk = 2 if min_chnk is None else min_chnk
        dyads = 5 if dyads is None else dyads
        res = analyse(data, use_pre_screening, min_chnk, dyads, threshold)
        times = np.r_[[t_load], res['timings']]
        print(list(zip(('load', 'precomp', 'max/min', 'ABCD', 'main'), times)))
        print('TOTAL', times.sum())
        return res

Answer 5

~~這在技術上是偏離主題的（不是GPU），但我相信你會感興趣。~~

有一個顯而易見的，相當大的節省：

預計算A + B + C + D （不在循環中，在整個數據上： data1.sum(axis=-1) ），因為abcd = ((A+B+C+D) - 4Cmin) / (4dif) 。 這將節省相當多的操作。

驚訝沒有人發現那一個;-)

編輯：

還有一件事，雖然我懷疑這只是在你的例子中，而不是在你的真實數據中：

因為它大約有一半的data2a將小於data2b 。 在這些地方你對abcd的條件不能都是真的，所以你甚至不需要在那里計算abcd 。

編輯：

我在下面使用了一個調整，但忘了提一下：如果計算移動窗口的最大值（或最小值）。 當你向右移動一個點時，比如說，最大變化的可能性有多大？ 只有兩件事可以改變它：右邊的新點更大（在窗口時間內大致發生一次，即使發生了，你立即知道新的最大值）或舊的最大值從左邊的窗口落下（在窗口時間也大約發生一次）。 僅在最后一種情況下，您必須在整個窗口中搜索新的最大值。

編輯：

無法抗拒在張量流中試一試。 我沒有GPU，所以你自己必須測試速度。 在標記的行上輸入“cpu”作為“cpu”。

在cpu上，它大約是原始實現的一半（即沒有Divakar的調整）。 請注意，我已經冒昧地將輸入從矩陣更改為普通數組。 目前，tensorflow是一個移動目標，因此請確保您擁有正確的版本。 我使用Python3.6和tf 0.12.1如果你今天做pip3安裝tensorflow-gpu吧 應該 可能有用。

import numpy as np
import time
import tensorflow as tf

# currently the max/min code is sequential
# thus
parallel_iterations = 1
# but you can put this in a separate loop, precompute and then try and run
# the remainder of doTheMathTF with a larger parallel_iterations

# tensorflow is quite capricious about its data types
ddf = tf.float64
ddi = tf.int32

def worker(data1, data2a, data2b):
    ###################################
    # CHANGE cpu to gpu in next line! #
    ###################################
    with tf.device('/cpu:0'):
        g = tf.Graph ()
        with g.as_default():
            ABCD = tf.constant(data1.sum(axis=-1), dtype=ddf)
            B = tf.constant(data1[:, 1], dtype=ddf)
            C = tf.constant(data1[:, 2], dtype=ddf)
            window = tf.constant(len(data2a))
            N = tf.constant(data1.shape[0] - len(data2a) + 1, dtype=ddi)
            data2a = tf.constant(data2a, dtype=ddf)
            data2b = tf.constant(data2b, dtype=ddf)
            def doTheMathTF(i, Bmax, Bmaxind, Cmin, Cminind, out):
                # most of the time we can keep the old max/min
                Bmaxind = tf.cond(Bmaxind<i,
                                  lambda: i + tf.to_int32(
                                      tf.argmax(B[i:i+window], axis=0)),
                                  lambda: tf.cond(Bmax>B[i+window-1], 
                                                  lambda: Bmaxind, 
                                                  lambda: i+window-1))
                Cminind = tf.cond(Cminind<i,
                                  lambda: i + tf.to_int32(
                                      tf.argmin(C[i:i+window], axis=0)),
                                  lambda: tf.cond(Cmin<C[i+window-1],
                                                  lambda: Cminind,
                                                  lambda: i+window-1))
                Bmax = B[Bmaxind]
                Cmin = C[Cminind]
                abcd = (ABCD[i:i+window] - 4 * Cmin) * (1 / (4 * (Bmax-Cmin)))
                out = out.write(i, tf.to_int32(
                    tf.count_nonzero(tf.logical_and(abcd <= data2a,
                                                    abcd >= data2b))))
                return i + 1, Bmax, Bmaxind, Cmin, Cminind, out
            with tf.Session(graph=g) as sess:
                i, Bmaxind, Bmax, Cminind, Cmin, out = tf.while_loop(
                    lambda i, _1, _2, _3, _4, _5: i<N, doTheMathTF,
                    (tf.Variable(0, dtype=ddi), tf.Variable(0.0, dtype=ddf),
                     tf.Variable(-1, dtype=ddi),
                     tf.Variable(0.0, dtype=ddf), tf.Variable(-1, dtype=ddi),
                     tf.TensorArray(ddi, size=N)),
                    shape_invariants=None,
                    parallel_iterations=parallel_iterations,
                    back_prop=False)
                out = out.pack()
                sess.run(tf.initialize_all_variables())
                out, = sess.run((out,))
    return out

#Declare variables
batchSize = 2000
sampleSize = 50000#00
resultArray = []

#Create Sample Data
data1 = np.random.uniform(1, 100, (sampleSize + batchSize, 4))
data2a = np.random.uniform(0, 1, (batchSize,))
data2b = np.random.uniform(0, 1, (batchSize,))

t0 = time.time()
out = worker(data1, data2a, data2b)
print('Runtime (tensorflow):', time.time() - t0)


good_indices, = np.where(out >= 490)
res_tf = np.c_[good_indices, out[good_indices]]

def doTheMath(tmpData1, data2a, data2b):
    A = tmpData1[:, 0]
    B  = tmpData1[:,1]
    C   = tmpData1[:,2]
    D = tmpData1[:,3]
    Bmax = B.max()
    Cmin  = C.min()
    dif = (Bmax - Cmin)
    abcd = ((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ((C   - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
    return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()

#Loop through the data
t0 = time.time()
for rowNr in  range(sampleSize+1):
    tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
    result = doTheMath(tmp_df, data2a, data2b)
    if (result >= 490):
        resultArray.append([rowNr , result])
print('Runtime (original):', time.time() - t0)
print(np.alltrue(np.array(resultArray)==res_tf))

Python：重寫循環numpy數學函數以在GPU上運行

問題描述

5 個解決方案

解決方案1
10 2017-01-31 14:13:26

解決方案2
8 已采納 2017-02-07 00:08:41

解決方案3
5 2017-02-03 19:43:17

解決方案4
4 2017-02-03 20:37:38

解決方案5
3 2017-02-03 03:27:18

Python：重寫循環numpy數學函數以在GPU上運行

問題描述

5 個解決方案

解決方案1 10 2017-01-31 14:13:26

解決方案2 8 已采納 2017-02-07 00:08:41

解決方案3 5 2017-02-03 19:43:17

解決方案4 4 2017-02-03 20:37:38

解決方案5 3 2017-02-03 03:27:18

解決方案1
10 2017-01-31 14:13:26

解決方案2
8 已采納 2017-02-07 00:08:41

解決方案3
5 2017-02-03 19:43:17

解決方案4
4 2017-02-03 20:37:38

解決方案5
3 2017-02-03 03:27:18