Cython Gibbs采樣器比numpy采樣器慢一點

Question

我已經實現了一個Gibbs采樣器來生成紋理圖像。 根據beta參數（shape（4）的數組），我們可以生成各種紋理。

這是我使用Numpy的初始函數：

def gibbs_sampler(img_label, betas, burnin, nb_samples):
    nb_iter = burnin + nb_samples

    lst_samples = []

    labels = np.unique(img)

    M, N = img.shape
    img_flat = img.flatten()

    # build neighborhood array by means of numpy broadcasting:
    m, n = np.ogrid[0:M, 0:N]

    top_left, top, top_right =   m[0:-2, :]*N + n[:, 0:-2], m[0:-2, :]*N + n[:, 1:-1]  , m[0:-2, :]*N + n[:, 2:]
    left, pix, right = m[1:-1, :]*N + n[:, 0:-2],  m[1:-1, :]*N + n[:, 1:-1], m[1:-1, :]*N + n[:, 2:]
    bottom_left, bottom, bottom_right = m[2:, :]*N + n[:, 0:-2],  m[2:, :]*N + n[:, 1:-1], m[2:, :]*N + n[:, 2:]

    mat_neigh = np.dstack([pix, top, bottom, left, right, top_left, bottom_right, bottom_left, top_right])

    mat_neigh = mat_neigh.reshape((-1, 9))    
    ind = np.arange((M-2)*(N-2))  

    # loop over iterations
    for iteration in np.arange(nb_iter):

        np.random.shuffle(ind)

        # loop over pixels
        for i in ind:                  

            truc = map(functools.partial(lambda label, img_flat, mat_neigh : 1-np.equal(label, img_flat[mat_neigh[i, 1:]]).astype(np.uint), img_flat=img_flat, mat_neigh=mat_neigh), labels)
            # bidule is of shape (4, 2, labels.size)
            bidule = np.array(truc).T.reshape((-1, 2, labels.size))

            # theta is of shape (labels.size, 4) 
            theta = np.sum(bidule, axis=1).T
            # prior is thus an array of shape (labels.size)
            prior = np.exp(-np.dot(theta, betas))

            # sample from the posterior
            drawn_label = np.random.choice(labels, p=prior/np.sum(prior))

            img_flat[(i//(N-2) + 1)*N + i%(N-2) + 1] = drawn_label


        if iteration >= burnin:
            print('Iteration %i --> sample' % iteration)
            lst_samples.append(copy.copy(img_flat.reshape(M, N)))

        else:
            print('Iteration %i --> burnin' % iteration)

    return lst_samples

我們無法擺脫任何循環，因為它是一個迭代算法。 因此我試圖通過使用Cython（使用靜態類型）來加速它：

from __future__ import division
import numpy as np
import copy
cimport numpy as np
import functools
cimport cython

INTTYPE = np.int
DOUBLETYPE = np.double

ctypedef np.int_t INTTYPE_t
ctypedef  np.double_t DOUBLETYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)


def func_for_map(label, img_flat,  mat_neigh, i):

   return  (1-np.equal(label, img_flat[mat_neigh[i, 1:]])).astype(INTTYPE)


def gibbs_sampler(np.ndarray[INTTYPE_t, ndim=2] img_label, np.ndarray[DOUBLETYPE_t, ndim=1] betas, INTTYPE_t burnin=5, INTTYPE_t nb_samples=1):


    assert img_label.dtype == INTTYPE and betas.dtype== DOUBLETYPE

    cdef unsigned int nb_iter = burnin + nb_samples 

    lst_samples = list()

    cdef np.ndarray[INTTYPE_t, ndim=1] labels
    labels = np.unique(img_label)

    cdef unsigned int M, N
    M = img_label.shape[0]
    N = img_label.shape[1]

    cdef np.ndarray[INTTYPE_t, ndim=1] ind     
    ind = np.arange((M-2)*(N-2), dtype=INTTYPE)

    cdef np.ndarray[INTTYPE_t, ndim=1] img_flat
    img_flat = img_label.flatten()


    # build neighborhood array:
    cdef np.ndarray[INTTYPE_t, ndim=2] m
    cdef np.ndarray[INTTYPE_t, ndim=2] n


    m = (np.ogrid[0:M, 0:N][0]).astype(INTTYPE)
    n = (np.ogrid[0:M, 0:N][1]).astype(INTTYPE)



    cdef np.ndarray[INTTYPE_t, ndim=2] top_left, top, top_right, left, pix, right, bottom_left, bottom, bottom_right

    top_left, top, top_right =   m[0:-2, :]*N + n[:, 0:-2], m[0:-2, :]*N + n[:, 1:-1]  , m[0:-2, :]*N + n[:, 2:]
    left, pix, right = m[1:-1, :]*N + n[:, 0:-2],  m[1:-1, :]*N + n[:, 1:-1], m[1:-1, :]*N + n[:, 2:]
    bottom_left, bottom, bottom_right = m[2:, :]*N + n[:, 0:-2],  m[2:, :]*N + n[:, 1:-1], m[2:, :]*N + n[:, 2:]

    cdef np.ndarray[INTTYPE_t, ndim=3] mat_neigh_init
    mat_neigh_init = np.dstack([pix, top, bottom, left, right, top_left, bottom_right, bottom_left, top_right])

    cdef np.ndarray[INTTYPE_t, ndim=2] mat_neigh
    mat_neigh = mat_neigh_init.reshape((-1, 9))    

    cdef unsigned int i
    truc = list()
    cdef np.ndarray[INTTYPE_t, ndim=3] bidule
    cdef np.ndarray[INTTYPE_t, ndim=2] theta
    cdef np.ndarray[DOUBLETYPE_t, ndim=1] prior
    cdef unsigned int drawn_label, iteration       



    # loop over ICE iterations
    for iteration in np.arange(nb_iter):

        np.random.shuffle(ind) 

        # loop over pixels        
        for i in ind:            

            truc = map(functools.partial(func_for_map, img_flat=img_flat, mat_neigh=mat_neigh, i=i), labels)                        


            bidule = np.array(truc).T.reshape((-1, 2, labels.size)).astype(INTTYPE)            


            theta = np.sum(bidule, axis=1).T

            # ok so far

            prior = np.exp(-np.dot(theta, betas)).astype(DOUBLETYPE)
#            print('ok after prior') 
#            return 0
            # sample from the posterior
            drawn_label = np.random.choice(labels, p=prior/np.sum(prior))

            img_flat[(i//(N-2) + 1)*N + i%(N-2) + 1] = drawn_label


        if iteration >= burnin:
            print('Iteration %i --> sample' % iteration)
            lst_samples.append(copy.copy(img_flat.reshape(M, N)))

        else:
            print('Iteration %i --> burnin' % iteration)   



    return lst_samples

然而，我最終得到了幾乎相同的計算時間，numpy版本比Cython版本略快。

因此我試圖改進Cython代碼。

編輯：

對於這兩個函數（Cython和沒有Cython）：我已經替換了：

truc = map(functools.partial(lambda label, img_flat, mat_neigh : 1-np.equal(label, img_flat[mat_neigh[i, 1:]]).astype(np.uint), img_flat=img_flat, mat_neigh=mat_neigh), labels)

通過廣播：

truc = 1-np.equal(labels[:, None], img_flat[mat_neigh[i, 1:]][None, :])

按range計算所有np.arange ，並且現在通過Divakar建議的np.einsum完成先驗的計算。

這兩個函數都比以前更快，但Python的速度仍然比Cython快一些。

Answer 1

我在源上以注釋模式運行Cython ，並查看結果。 也就是說，把它保存在q.pyx ，我跑了

cython -a q.pyx
firefox q.html

（當然，使用你想要的任何瀏覽器）。

代碼顏色為深黃色，表示就Cython而言，代碼遠非靜態類型。 AFAICT，分為兩類。

在某些情況下，您可以更好地靜態鍵入代碼：

for iteration in np.arange(nb_iter):和for i in ind: ，你需要為每次迭代支付大約30個C行。 請參閱此處如何在Cython中高效訪問numpy數組。
在truc = map(functools.partial(func_for_map, img_flat=img_flat, mat_neigh=mat_neigh, i=i), labels) ，你並沒有從靜態類型中獲得任何好處。 我建議你cdef函數func_for_map ，並在循環中自己調用它。

在其他情況下，你正在調用numpy向量化函數，例如theta = np.sum(bidule, axis=1).T ， prior = np.exp(-np.dot(theta, betas)).astype(DOUBLETYPE)在這些情況下，Cython確實沒有太大的好處。

Answer 2

如果您希望加速NumPy代碼，我們可以提高內部循環內部的性能，並希望這可以轉化為一些整體改進。

那么，我們有：

theta = np.sum(bidule, axis=1).T
prior = np.exp(-np.dot(theta, betas))

將求和與矩陣乘法合並為一步，我們將 -

np.dot(np.sum(bidule, axis=1).T, betas)

現在，這涉及沿軸進行求和，然后在逐元素乘法之后求和。 在許多工具中，我們有np.einsum來幫助我們，特別是因為我們可以一次性執行這些減少，就像這樣 -

np.einsum('ijk,i->k',bidule,betas)

運行時測試 -

In [98]: # Setup
    ...: N = 100
    ...: bidule = np.random.rand(4,2,N)
    ...: betas = np.random.rand(4)
    ...: 

In [99]: %timeit np.dot(np.sum(bidule, axis=1).T, betas)
100000 loops, best of 3: 12.4 µs per loop

In [100]: %timeit np.einsum('ijk,i->k',bidule,betas)
100000 loops, best of 3: 4.05 µs per loop

In [101]: # Setup
     ...: N = 10000
     ...: bidule = np.random.rand(4,2,N)
     ...: betas = np.random.rand(4)
     ...: 

In [102]: %timeit np.dot(np.sum(bidule, axis=1).T, betas)
10000 loops, best of 3: 157 µs per loop

In [103]: %timeit np.einsum('ijk,i->k',bidule,betas)
10000 loops, best of 3: 90.9 µs per loop

因此，希望在運行多次迭代時，加速會很明顯。

Answer 3

這個答案很好地解釋了為什么Numpy效率低下並且你仍然想要使用Cython。 基本上：

小陣列的開銷（也減小像np.sum(bidule, axis=1)這樣的小尺寸np.sum(bidule, axis=1) ;
由於中介，緩存大型陣列的顛簸。

在這種情況下，要從Cython中受益，你必須用普通的 Python循環替換Numpy數組操作--Cython必須能夠將它轉換為C代碼，否則沒有意義。 這並不意味着你必須重寫所有Numpy函數，你必須對它有所了解。

例如，你應該擺脫mat_neigh和bidule數組，只需循環索引和求和。

另一方面，您應該保留（規范化的） prior數組並繼續使用np.random.choice 。 這並不是一個簡單的方法（嗯......參見choice來源）。 不幸的是，這意味着這部分可能會成為性能瓶頸。

Cython Gibbs采樣器比numpy采樣器慢一點

問題描述

3 個解決方案

解決方案1
3 2016-10-25 18:40:47

解決方案2
2 2016-10-25 18:12:38

解決方案3
1 已采納 2016-10-27 13:13:34

Cython Gibbs采樣器比numpy采樣器慢一點

問題描述

3 個解決方案

解決方案1 3 2016-10-25 18:40:47

解決方案2 2 2016-10-25 18:12:38

解決方案3 1 已采納 2016-10-27 13:13:34

解決方案1
3 2016-10-25 18:40:47

解決方案2
2 2016-10-25 18:12:38

解決方案3
1 已采納 2016-10-27 13:13:34