Performance of xtensor types vs. NumPy for simple reduction

Question

I was trying out xtensor-python and started by writing a very simple sum function, after using the cookiecutter setup and enabling SIMD intrinsics with xsimd .

inline double sum_pytensor(xt::pytensor<double, 1> &m)
{
  return xt::sum(m)();
}
inline double sum_pyarray(xt::pyarray<double> &m)
{
  return xt::sum(m)();
}

Used setup.py to build my Python module, then tested out the summation function on NumPy arrays constructed from np.random.randn of different sizes, comparing to np.sum .

import timeit

def time_each(func_names, sizes):
    setup = f'''
import numpy; import xtensor_basics
arr = numpy.random.randn({sizes})
    '''
    tim = lambda func: min(timeit.Timer(f'{func}(arr)',
                                        setup=setup).repeat(7, 100))
    return [tim(func) for func in func_names]

from functools import partial

sizes = [10 ** i for i in range(9)]
funcs = ['numpy.sum',
         'xtensor_basics.sum_pyarray',
         'xtensor_basics.sum_pytensor']
sum_timer = partial(time_each, funcs)
times = list(map(sum_timer, sizes))

This (possibly flawed) benchmark seemed to indicate that performance of xtensor for this basic function degraded for larger arrays as compared to NumPy.

           numpy.sum  xtensor_basics.sum_pyarray  xtensor_basics.sum_pytensor
1           0.000268                    0.000039                     0.000039
10          0.000258                    0.000040                     0.000039
100         0.000247                    0.000048                     0.000049
1000        0.000288                    0.000167                     0.000164
10000       0.000568                    0.001353                     0.001341
100000      0.003087                    0.013033                     0.013038
1000000     0.045171                    0.132150                     0.132174
10000000    0.434112                    1.313274                     1.313434
100000000   4.180580                   13.129517                    13.129058

benchfig

Any idea on why I'm seeing this? I'm guessing it's something NumPy utilizes that xtensor does not (yet), but I wasn't sure what it could be for a reduction as simple as this. I dug through xmath.hpp but didn't see anything obvious, and nothing like this is referenced in the documentation.

Versions

numpy                          1.13.3
openblas                       0.2.20
python                         3.6.3
xtensor                        0.12.1
xtensor-python                 0.14.0

Answer 1

wow this is a coincidence! I am working on exactly this speedup!

xtensor's sum is a lazy operation -- and it doesn't use the most performant iteration order for (auto-)vectorization. However, we just added a evaluation_strategy parameter to reductions (and the upcoming accumulations) which allows you to select between immediate and lazy reductions.

Immediate reductions perform the reduction immediately (and not lazy) and can use a iteration order optimized for vectorized reductions.

You can find this feature in this PR: https://github.com/QuantStack/xtensor/pull/550

In my benchmarks this should be at least as fast or faster than numpy. I hope to get it merged today.

Btw. please don't hesitate to drop by our gitter channel and post a link to the question, we need to monitor StackOverflow better: https://gitter.im/QuantStack/Lobby

Performance of xtensor types vs. NumPy for simple reduction

Question

1 answers

solution1
5 ACCPTED 2017-11-23 10:55:47

Performance of xtensor types vs. NumPy for simple reduction

Question

1 answers

solution1 5 ACCPTED 2017-11-23 10:55:47

solution1
5 ACCPTED 2017-11-23 10:55:47