[英]Performance of xtensor types vs. NumPy for simple reduction
I was trying out xtensor-python and started by writing a very simple sum function, after using the cookiecutter setup and enabling SIMD intrinsics with xsimd . 在使用cookiecutter设置并使用xsimd启用SIMD内部函数之后,我尝试了xtensor-python并开始编写一个非常简单的sum函数。
inline double sum_pytensor(xt::pytensor<double, 1> &m)
{
return xt::sum(m)();
}
inline double sum_pyarray(xt::pyarray<double> &m)
{
return xt::sum(m)();
}
Used setup.py
to build my Python module, then tested out the summation function on NumPy arrays constructed from np.random.randn
of different sizes, comparing to np.sum
. 二手
setup.py
打造我的Python模块,然后测试了从构造与NumPy阵列求和函数np.random.randn
不同尺寸的,比较np.sum
。
import timeit
def time_each(func_names, sizes):
setup = f'''
import numpy; import xtensor_basics
arr = numpy.random.randn({sizes})
'''
tim = lambda func: min(timeit.Timer(f'{func}(arr)',
setup=setup).repeat(7, 100))
return [tim(func) for func in func_names]
from functools import partial
sizes = [10 ** i for i in range(9)]
funcs = ['numpy.sum',
'xtensor_basics.sum_pyarray',
'xtensor_basics.sum_pytensor']
sum_timer = partial(time_each, funcs)
times = list(map(sum_timer, sizes))
This (possibly flawed) benchmark seemed to indicate that performance of xtensor for this basic function degraded for larger arrays as compared to NumPy. 该基准测试(可能存在缺陷)似乎表明,与NumPy相比,对于较大的阵列,此基本功能的xtensor性能下降。
numpy.sum xtensor_basics.sum_pyarray xtensor_basics.sum_pytensor
1 0.000268 0.000039 0.000039
10 0.000258 0.000040 0.000039
100 0.000247 0.000048 0.000049
1000 0.000288 0.000167 0.000164
10000 0.000568 0.001353 0.001341
100000 0.003087 0.013033 0.013038
1000000 0.045171 0.132150 0.132174
10000000 0.434112 1.313274 1.313434
100000000 4.180580 13.129517 13.129058
Any idea on why I'm seeing this? 我为什么会看到这个想法? I'm guessing it's something NumPy utilizes that xtensor does not (yet), but I wasn't sure what it could be for a reduction as simple as this.
我想这是NumPy尚未利用xtensor的东西(至今),但是我不知道这样简单的减少可能是什么。 I dug through xmath.hpp but didn't see anything obvious, and nothing like this is referenced in the documentation.
我通过xmath.hpp进行了挖掘,但没有发现任何明显的东西,并且在文档中未引用任何类似内容。
Versions 版本
numpy 1.13.3
openblas 0.2.20
python 3.6.3
xtensor 0.12.1
xtensor-python 0.14.0
wow this is a coincidence! 哇,这是一个巧合! I am working on exactly this speedup!
我正在为此而努力!
xtensor's sum is a lazy operation -- and it doesn't use the most performant iteration order for (auto-)vectorization. xtensor的总和是一个懒惰的操作-并且它不使用性能最高的迭代顺序进行(自动)矢量化。 However, we just added a
evaluation_strategy
parameter to reductions (and the upcoming accumulations) which allows you to select between immediate
and lazy
reductions. 但是,我们只向减少量(和即将发生的累积量)添加了
evaluation_strategy
参数,使您可以在immediate
减少量和lazy
减少量之间进行选择。
Immediate reductions perform the reduction immediately (and not lazy) and can use a iteration order optimized for vectorized reductions. 立即归约立即执行归约(而不是延迟),并且可以使用针对矢量化归约优化的迭代顺序。
You can find this feature in this PR: https://github.com/QuantStack/xtensor/pull/550 您可以在此PR中找到此功能: https : //github.com/QuantStack/xtensor/pull/550
In my benchmarks this should be at least as fast or faster than numpy. 在我的基准测试中,这至少应该比numpy快或快。 I hope to get it merged today.
我希望今天将其合并。
Btw. 顺便说一句。 please don't hesitate to drop by our gitter channel and post a link to the question, we need to monitor StackOverflow better: https://gitter.im/QuantStack/Lobby
请不要犹豫,访问我们的gitter频道并发布问题的链接,我们需要更好地监视StackOverflow: https ://gitter.im/QuantStack/Lobby
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.