I was trying out xtensor-python and started by writing a very simple sum function, after using the cookiecutter setup and enabling SIMD intrinsics with xsimd .
inline double sum_pytensor(xt::pytensor<double, 1> &m)
{
return xt::sum(m)();
}
inline double sum_pyarray(xt::pyarray<double> &m)
{
return xt::sum(m)();
}
Used setup.py
to build my Python module, then tested out the summation function on NumPy arrays constructed from np.random.randn
of different sizes, comparing to np.sum
.
import timeit
def time_each(func_names, sizes):
setup = f'''
import numpy; import xtensor_basics
arr = numpy.random.randn({sizes})
'''
tim = lambda func: min(timeit.Timer(f'{func}(arr)',
setup=setup).repeat(7, 100))
return [tim(func) for func in func_names]
from functools import partial
sizes = [10 ** i for i in range(9)]
funcs = ['numpy.sum',
'xtensor_basics.sum_pyarray',
'xtensor_basics.sum_pytensor']
sum_timer = partial(time_each, funcs)
times = list(map(sum_timer, sizes))
This (possibly flawed) benchmark seemed to indicate that performance of xtensor for this basic function degraded for larger arrays as compared to NumPy.
numpy.sum xtensor_basics.sum_pyarray xtensor_basics.sum_pytensor
1 0.000268 0.000039 0.000039
10 0.000258 0.000040 0.000039
100 0.000247 0.000048 0.000049
1000 0.000288 0.000167 0.000164
10000 0.000568 0.001353 0.001341
100000 0.003087 0.013033 0.013038
1000000 0.045171 0.132150 0.132174
10000000 0.434112 1.313274 1.313434
100000000 4.180580 13.129517 13.129058
Any idea on why I'm seeing this? I'm guessing it's something NumPy utilizes that xtensor does not (yet), but I wasn't sure what it could be for a reduction as simple as this. I dug through xmath.hpp but didn't see anything obvious, and nothing like this is referenced in the documentation.
Versions
numpy 1.13.3
openblas 0.2.20
python 3.6.3
xtensor 0.12.1
xtensor-python 0.14.0
wow this is a coincidence! I am working on exactly this speedup!
xtensor's sum is a lazy operation -- and it doesn't use the most performant iteration order for (auto-)vectorization. However, we just added a evaluation_strategy
parameter to reductions (and the upcoming accumulations) which allows you to select between immediate
and lazy
reductions.
Immediate reductions perform the reduction immediately (and not lazy) and can use a iteration order optimized for vectorized reductions.
You can find this feature in this PR: https://github.com/QuantStack/xtensor/pull/550
In my benchmarks this should be at least as fast or faster than numpy. I hope to get it merged today.
Btw. please don't hesitate to drop by our gitter channel and post a link to the question, we need to monitor StackOverflow better: https://gitter.im/QuantStack/Lobby
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.