[英]Poor performance of C++ function in Cython
I have this C++ function, which I can call from Python with the code below. 我有这个C ++函数,可以使用以下代码从Python调用该函数。 The performance is only half compared to running pure C++.
与运行纯C ++相比,性能只有一半。 Is there a way to get their performance at the same level?
有没有办法使他们的表现达到相同水平? I compile both codes with
-Ofast -march=native
flags. 我用
-Ofast -march=native
标志编译这两个代码。 I do not understand where I can lose 50%, because most of the time should be spent in the C++ kernel. 我不知道我会在哪里损失50%,因为大多数时间应该花在C ++内核中。 Is Cython making a memory copy that I can avoid?
Cython是否正在制作我可以避免的内存副本?
namespace diff
{
void diff_cpp(double* __restrict__ at, const double* __restrict__ a, const double visc,
const double dxidxi, const double dyidyi, const double dzidzi,
const int itot, const int jtot, const int ktot)
{
const int ii = 1;
const int jj = itot;
const int kk = itot*jtot;
for (int k=1; k<ktot-1; k++)
for (int j=1; j<jtot-1; j++)
for (int i=1; i<itot-1; i++)
{
const int ijk = i + j*jj + k*kk;
at[ijk] += visc * (
+ ( (a[ijk+ii] - a[ijk ])
- (a[ijk ] - a[ijk-ii]) ) * dxidxi
+ ( (a[ijk+jj] - a[ijk ])
- (a[ijk ] - a[ijk-jj]) ) * dyidyi
+ ( (a[ijk+kk] - a[ijk ])
- (a[ijk ] - a[ijk-kk]) ) * dzidzi
);
}
}
}
I have this .pyx
file 我有这个
.pyx
文件
# import both numpy and the Cython declarations for numpy
import cython
import numpy as np
cimport numpy as np
# declare the interface to the C code
cdef extern from "diff_cpp.cpp" namespace "diff":
void diff_cpp(double* at, double* a, double visc, double dxidxi, double dyidyi, double dzidzi, int itot, int jtot, int ktot)
@cython.boundscheck(False)
@cython.wraparound(False)
def diff(np.ndarray[double, ndim=3, mode="c"] at not None,
np.ndarray[double, ndim=3, mode="c"] a not None,
double visc, double dxidxi, double dyidyi, double dzidzi):
cdef int ktot, jtot, itot
ktot, jtot, itot = at.shape[0], at.shape[1], at.shape[2]
diff_cpp(&at[0,0,0], &a[0,0,0], visc, dxidxi, dyidyi, dzidzi, itot, jtot, ktot)
return None
I call this function in Python 我在Python中称这个函数
import numpy as np
import diff
import time
nloop = 20;
itot = 256;
jtot = 256;
ktot = 256;
ncells = itot*jtot*ktot;
at = np.zeros((ktot, jtot, itot))
index = np.arange(ncells)
a = (index/(index+1))**2
a.shape = (ktot, jtot, itot)
# Check results
diff.diff(at, a, 0.1, 0.1, 0.1, 0.1)
print("at={0}".format(at.flatten()[itot*jtot+itot+itot//2]))
# Time the loop
start = time.perf_counter()
for i in range(nloop):
diff.diff(at, a, 0.1, 0.1, 0.1, 0.1)
end = time.perf_counter()
print("Time/iter: {0} s ({1} iters)".format((end-start)/nloop, nloop))
This is the setup.py
: 这是
setup.py
:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy
setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("diff",
sources=["diff.pyx"],
language="c++",
extra_compile_args=["-Ofast -march=native"],
include_dirs=[numpy.get_include()])],
)
And here the C++ reference file that reaches twice the performance: 这里的C ++参考文件达到了两倍的性能:
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <stdlib.h>
#include <cstdio>
#include <ctime>
#include "math.h"
void init(double* const __restrict__ a, double* const __restrict__ at, const int ncells)
{
for (int i=0; i<ncells; ++i)
{
a[i] = pow(i,2)/pow(i+1,2);
at[i] = 0.;
}
}
void diff(double* const __restrict__ at, const double* const __restrict__ a, const double visc,
const double dxidxi, const double dyidyi, const double dzidzi,
const int itot, const int jtot, const int ktot)
{
const int ii = 1;
const int jj = itot;
const int kk = itot*jtot;
for (int k=1; k<ktot-1; k++)
for (int j=1; j<jtot-1; j++)
for (int i=1; i<itot-1; i++)
{
const int ijk = i + j*jj + k*kk;
at[ijk] += visc * (
+ ( (a[ijk+ii] - a[ijk ])
- (a[ijk ] - a[ijk-ii]) ) * dxidxi
+ ( (a[ijk+jj] - a[ijk ])
- (a[ijk ] - a[ijk-jj]) ) * dyidyi
+ ( (a[ijk+kk] - a[ijk ])
- (a[ijk ] - a[ijk-kk]) ) * dzidzi
);
}
}
int main()
{
const int nloop = 20;
const int itot = 256;
const int jtot = 256;
const int ktot = 256;
const int ncells = itot*jtot*ktot;
double *a = new double[ncells];
double *at = new double[ncells];
init(a, at, ncells);
// Check results
diff(at, a, 0.1, 0.1, 0.1, 0.1, itot, jtot, ktot);
printf("at=%.20f\n",at[itot*jtot+itot+itot/2]);
// Time performance
std::clock_t start = std::clock();
for (int i=0; i<nloop; ++i)
diff(at, a, 0.1, 0.1, 0.1, 0.1, itot, jtot, ktot);
double duration = (std::clock() - start ) / (double)CLOCKS_PER_SEC;
printf("time/iter = %f s (%i iters)\n",duration/(double)nloop, nloop);
return 0;
}
The problem here is not what is happening during the run, but which optimization is happening during the compilation. 这里的问题不是运行期间发生的事情,而是编译期间发生的优化。
Which optimization is done depends on the compiler (or even version) and there is no guarantee that every optimization, which can be done will be done. 哪个优化完成取决于编译器(甚至版本),并且不能保证可以完成的每个优化都会完成。
Actually there are two different reasons why cython is slower, depending on whether you use g++ or clang++: 实际上,取决于您使用g ++还是clang ++,cython变慢的原因有两个:
-fwrapv
in the cython build -fwrapv
在用Cython构建 First issue (g++) : Cython compiles with different flags compared to the flags of your pure c++-program and as result some optimizations can't be done. 第一个问题(g ++) :与纯c ++程序的标志相比,Cython编译时具有不同的标志,因此无法进行某些优化。
If you look at the log of the setup, you will see: 如果查看设置日志,将会看到:
x86_64-linux-gnu-gcc ... -O2 ..-fwrapv .. -c diff.cpp ... -Ofast -march=native
As you told, -Ofast
will win against -O2
because it comes last. 正如您所说,
-Ofast
将击败-O2
因为它排在最后。 But the problem is -fwrapv
, which seems to prevent some optimization, as signed integer overflow cannot be considered UB and used for optimization any longer. 但是问题是
-fwrapv
,它似乎阻止了一些优化,因为带符号的整数溢出不能被视为UB,并且不再用于优化。
So you have following options: 因此,您有以下选择:
-fno-wrapv
to extra_compile_flags
, the disadvantage is, that all files are now compiled with changed flags, what might be unwanted. -fno-wrapv
添加到extra_compile_flags
,缺点是现在所有文件都使用已更改的标志进行编译,这可能是不需要的。 Second issue (clang++) inlining in the test cpp-program. 内联在测试cpp程序中的第二个问题(clang ++) 。
When I compile your cpp-program with my pretty old 5.4-version g++: 当我用相当老的5.4版本g ++编译您的cpp程序时:
g++ test.cpp -o test -Ofast -march=native -fwrapv
it becomes almost 3-times slower compared to the compilation without -fwrapv
. 与没有
-fwrapv
的编译相比,它慢了-fwrapv
。 This is however a weakness of the optimizer: When inlining, it should see, that no signed-integer overflow is possible (all dimensions are about 256
), so the flag -fwrapv
shouldn't have any impact. 但是,这是优化程序的弱点:进行内联时,应该看到没有可能发生带符号整数溢出(所有维数均为
256
左右),因此标志-fwrapv
应该不会产生任何影响。
My old clang++
-version (3.8) seems to do a better job here: with the flags above I cannot see any degradation of the performance. 我以前的
clang++
-version(3.8)似乎在这里做得更好:使用上面的标志,我看不到任何性能下降。 I need to disable inlining via -fno-inline
to become a slower code but it is slower even without -fwrapv
ie: 我需要通过
-fno-inline
禁用内-fno-inline
以使其成为较慢的代码,但即使没有-fwrapv
也是-fwrapv
即:
clang++ test.cpp -o test -Ofast -march=native -fno-inline
So there is a systematical bias in favor of your c++-program: the optimizer can optimize the code for the known values after the inlining - something the cython can not do. 因此,系统上倾向于使用c ++程序:内联后,优化器可以针对已知值优化代码-cython无法做到的事情。
So we can see: clang++ was not able to optimize function diff
with arbitrary sizes but was able to optimize it for size=256. 因此,我们可以看到:clang ++无法优化具有任意大小的
function diff
,但能够针对size = 256对其进行优化。 Cython however, can only use the not optimized version of diff
. 但是,Cython只能使用
diff
的未优化版本。 That is the reason, why -fno-wrapv
has no positive impact. 这就是为什么
-fno-wrapv
没有积极影响的原因。
My take-away from it: disallow inlining of the function of interest (eg compile it in its own object file) in the cpp-tester to ensure a level ground with cython, otherwise one sees performance of a program which was specially optimized for this one input. 我的收获:禁止在cpp-tester中内联感兴趣的功能(例如,将其编译到自己的目标文件中),以确保与cython保持平衡;否则,人们会看到为此目的专门优化的程序的性能一个输入。
NB: A funny thing is, that if all int
s are replaced by unsigned int
s, then naturally -fwrapv
doesn't play any role, but the version with unsigned int
is as slow as int
-version with -fwrapv
, which is only logical, as there is no undefined behavior to be exploited. 注意:有趣的是,如果将所有
int
都替换为unsigned int
,那么-fwrapv
自然不会发挥任何作用,但是使用unsigned int
的版本与使用-fwrapv
int
-version一样慢,这仅是-fwrapv
逻辑,因为没有未定义的行为可利用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.