Cython prange较慢，4线程然后有范围

Question

I am currently trying to follow a simple example for parallelizing a loop with cython's prange. 我目前正在尝试使用一个简单的例子来并行化循环与cython的prange。 I have installed OpenBlas 0.2.14 with openmp allowed and compiled numpy 1.10.1 and scipy 0.16 from source against openblas. 我已经安装了OpenBlas 0.2.14，允许使用openmp并编译了numpy 1.10.1和scipy 0.16来自openblas的源代码。 To test the performance of the libraries I am following this example: http://nealhughes.net/parallelcomp2/ . 为了测试库的性能，我遵循这个例子： http ： //nealhughes.net/parallelcomp2/ 。 The functions to be timed are copied form the site: 要定时的函数从站点复制：

import numpy as np
from math import exp 
from libc.math cimport exp as c_exp
from cython.parallel import prange,parallel

def array_f(X):

    Y = np.zeros(X.shape)
    index = X > 0.5
    Y[index] = np.exp(X[index])

    return Y

def c_array_f(double[:] X):

    cdef int N = X.shape[0]
    cdef double[:] Y = np.zeros(N)
    cdef int i

    for i in range(N):
        if X[i] > 0.5:
            Y[i] = c_exp(X[i])
        else:
            Y[i] = 0

    return Y


def c_array_f_multi(double[:] X):

    cdef int N = X.shape[0]
    cdef double[:] Y = np.zeros(N)
    cdef int i
    with nogil, parallel():
        for i in prange(N):
            if X[i] > 0.5:
                Y[i] = c_exp(X[i])
            else:
                Y[i] = 0

    return Y

The author of the code reports following speed ups for 4 cores: 该代码的作者报告了4个核心的加速：

from thread_demo import *
import numpy as np
X = -1 + 2*np.random.rand(10000000) 
%timeit array_f(X)
1 loops, best of 3: 222 ms per loop
%timeit c_array_f(X)
10 loops, best of 3: 87.5 ms per loop 
%timeit c_array_f_multi(X)
10 loops, best of 3: 22.4 ms per loop

When I run these example on my machines ( macbook pro with osx 10.10 ), I get the following timings for export OMP_NUM_THREADS=1 当我在我的机器上运行这些示例（macbook pro with osx 10.10）时，我得到以下导出时间OMP_NUM_THREADS=1

In [1]: from bla import *
In [2]: import numpy as np
In [3]: X = -1 + 2*np.random.rand(10000000)
In [4]: %timeit c_array_f(X)
10 loops, best of 3: 89.7 ms per loop
In [5]: %timeit c_array_f_multi(X)
1 loops, best of 3: 343 ms per loop

and for OMP_NUM_THREADS=4 并且对于OMP_NUM_THREADS=4

In [1]: from bla import *
In [2]: import numpy as np
In [3]: X = -1 + 2*np.random.rand(10000000)
In [4]: %timeit c_array_f(X)
10 loops, best of 3: 89.5 ms per loop
In [5]: %timeit c_array_f_multi(X)
10 loops, best of 3: 119 ms per loop

I see this same behavior on an openSuse machine, hence my question. 我在openSuse机器上看到了同样的行为，因此我的问题。 How can the author get a 4x speed up while the same code runs slower for 4 threads on 2 of my systems. 如果我的两个系统上的4个线程的相同代码运行速度较慢，那么作者如何获得4倍的加速。

The setup script for generating the *.c & .so is also identical to the one used in the blog. 用于生成*.c & .so的设置脚本也与博客中使用的脚本相同。

from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy as np

ext_modules=[
    Extension("bla",
              ["bla.pyx"],
              libraries=["m"],
              extra_compile_args = ["-O3", "-ffast-math","-march=native", "-fopenmp" ],
              extra_link_args=['-fopenmp'],
              include_dirs = [np.get_include()]
              ) 
]

setup( 
  name = "bla",
  cmdclass = {"build_ext": build_ext},
  ext_modules = ext_modules
)

Would be great if someone could explain to me why this happens. 如果有人能向我解释为什么会发生这种情况会很棒。

Answer 1

1) An important feature of prange (like any other parallel for loop) is that it activates out-of-order execution, which means that the loop can execute in any arbitrary order. 1） prange一个重要特性（就像任何其他parallel for循环一样）是它激活无序执行，这意味着循环可以以任意顺序执行。 Out-of-order execution really pays off when you have no data dependency between iterations. 当迭代之间没有数据依赖时，乱序执行确实会得到回报。

I do not know the internals of Cython but I reckon that if boundscheck ing is not turned off, the loop cannot be executed arbitrarily, since the next iteration will depend on whether or not the array is going out of bounds in the current iteration, hence the problem becomes almost serial as threads will have to wait for the result. 我不知道Cython的内部结构，但我认为如果没有关闭boundscheck ，则循环不能任意执行，因为下一次迭代将取决于数组是否超出当前迭代的范围，因此问题变得几乎是串行的，因为线程必须等待结果。 This is one of the issues with your code. 这是您的代码的问题之一。 In fact Cython does give me the following warning: 事实上，Cython确实给了我以下警告：

warning: bla.pyx:42:16: Use boundscheck(False) for faster access

So add the following 所以添加以下内容

from cython import boundscheck, wraparound

@boundscheck(False)
@wraparound(False)
def c_array_f(double[:] X):
   # Rest of your code

@boundscheck(False)
@wraparound(False)
def c_array_f_multi(double[:] X):
   # Rest of your code

Let's now time them with your data X = -1 + 2*np.random.rand(10000000) . 现在让我们用你的数据计算时间X = -1 + 2*np.random.rand(10000000) 。

With Bounds Checking: 有界检查：

In [2]:%timeit array_f(X)
10 loops, best of 3: 189 ms per loop
In [4]:%timeit c_array_f(X)
10 loops, best of 3: 93.6 ms per loop
In [5]:%timeit c_array_f_multi(X)
10 loops, best of 3: 103 ms per loop

Without Bounds Checking: 没有边界检查：

In [9]:%timeit c_array_f(X)
10 loops, best of 3: 84.2 ms per loop
In [10]:%timeit c_array_f_multi(X)
10 loops, best of 3: 42.3 ms per loop

These results are with num_threads=4 (I have 4 logical cores) and the speed-up is around 2x. 这些结果是num_threads=4 （我有4个逻辑核心），加速大约是2倍。 Before getting further we can still shave off a few more ms by declaring our arrays to be contiguous ie declaring X and Y with double[::1] . 在进一步讨论之前，我们仍然可以通过声明我们的数组是连续的来减少几个ms ，即用double[::1]声明X和Y

Contiguous Arrays: 连续数组：

In [14]:%timeit c_array_f(X)
10 loops, best of 3: 81.8 ms per loop
In [15]:%timeit c_array_f_multi(X)
10 loops, best of 3: 39.3 ms per loop

2) Even more important is job scheduling and this is what your benchmark suffers from. 2）更重要的是作业调度，这是你的基准测试所遭受的。 By default chunk sizes are determined at compile time ie schedule=static however it is very likely that the environment variables (for instance OMP_SCHEDULE) and work-load of the two machines (yours and the one from the blog post) are different, and they schedule the jobs at runtime, dynmically, guidedly and so on. 默认情况下，块大小是在编译时确定的，即schedule=static但是很可能环境变量（例如OMP_SCHEDULE）和两台机器的工作负载（你的和博客文章中的那些）是不同的，并且它们是在运行时，动态，引导等方式安排作业。 Let's experiment it with replacing your prange to 让我们尝试将你的prange替换为

for i in prange(N, schedule='static'):
    # static scheduling... 
for i in prange(N, schedule='dynamic'):
    # dynamic scheduling...

Let's time them now (only the multi-threaded code): 我们现在给它们计时（只有多线程代码）：

Scheduling Effect: 调度效果：

In [23]:%timeit c_array_f_multi(X) # static
10 loops, best of 3: 39.5 ms per loop
In [28]:%timeit c_array_f_multi(X) # dynamic
1 loops, best of 3: 319 ms per loop

You might be able to replicate this depending on the work-load on your own machine. 您可以根据自己计算机上的工作负载来复制它。 As a side note, since you are just trying to measure the performance of a parallel vs serial code in a micro-benchmark test and not an actual code, I suggest you get rid of the if-else condition ie only keep Y[i] = c_exp(X[i]) within the for loop. 作为旁注，由于您只是想在微基准测试中测量并行与串行代码的性能而不是实际代码，我建议您摆脱if-else条件，即只保留Y[i] = c_exp(X[i]) for循环中Y[i] = c_exp(X[i]) 。 This is because if-else statements also adversely affect branch-prediction and out-of-order execution in parallel code. 这是因为if-else语句也会对并行代码中的分支预测和乱序执行产生负面影响。 On my machine I get almost 2.7x speed-up over serial code with this change. 在我的机器上，通过此更改，我获得了大约2.7倍的串行代码加速。

Cython prange较慢，4线程然后有范围

问题描述

1 个解决方案

解决方案1
10 已采纳 2015-10-18 23:32:56

Cython prange较慢，4线程然后有范围

问题描述

1 个解决方案

解决方案1 10 已采纳 2015-10-18 23:32:56

解决方案1
10 已采纳 2015-10-18 23:32:56