Why is numpy faster than my c/c++ code for summing an array of float?

Question

I was testing the efficiency of my simple shared C library and comparing it with the numpy implmentation.

Library creation : The following function is defined in sum_function.c :

float sum_vector(float* data, int num_row){
    float value = 0.0;
    for (int i = 0; i < num_row; i++){
        value += data[i];
    }
    return value;
}

Library compilation : the shared library sum.so is created by

clang -c sum_function.c
clang -shared -o sum.so sum_function.o

Measurement : a simple numpy array is created and the sum of its elements is calculated using the above function.

from ctypes import *
import numpy as np

N = int(1e7)
data = np.arange(N, dtype=np.float32)

libc = cdll.LoadLibrary("sum.so")
libc.sum_vector.restype = c_float
libc.sum_vector(data.ctypes.data_as(POINTER(c_float)),
                c_int(N))

The above function takes 30 ms. However, if I use numpy.sum, the execution time is only 4 ms.

So my question is: what makes numpy a lot faster than my C implementation? I cannot think about any improvement in terms of algorithm for calculating the sum of a vector.

Answer 1

There are many reasons that could be involved depending even on the compiler you are using. Your numpy backend is in many cases C/C++. In other words, you have to appreciate that languages like C++ allow for a lot more efficiency and contact to hardware but also demand a lot of knowledge. C++ less that C, as as long as you use the STL like in @PaulMcKenzie's comment. Those are routines that are optimized for runtime performance.

The next thing is memory allocation. Now, your vector seems large enough that the allocator inside <std::vector> will align the memory on the heap. Memory on the stack can end up unaligned keeping std::accumulate even to be slow. Here's an idea how such allocator could be written to avoid that: https://github.com/kvahed/codeare/blob/master/src/matrix/Allocator.hpp . This is part of an MRI image reconstruction library I wrote as a PhD student.

A word on SIMD: Same library other aspect. https://github.com/kvahed/codeare/blob/master/src/matrix/SIMDTraits.hpp How to do state of the art arithmetic is anything but trivial.

Both above concepts culminate into https://github.com/kvahed/codeare/blob/master/src/matrix/Matrix.hpp , where you easily outperform any standardized code on a specific machine.

And last but not least: The compiler and the compiler flags. Your runtime code should once debugged probably be compiled -O2 -g or even -O3 . If you have good test coverage you might even be able to get away with -Ofast which ditches ieee math precision. Apart of numerical integration I have never witnessed issues.

Answer 2

You need to enable optimizations

In addition to that you have to check if the compiler is able to use autovectorization. If you want distribute a compiled binary, you may want to add multiple codepaths (AVX2,SS2) to get a runable and performant version on all platforms.

A small overview of different implementations and their performance. If you can't beat the numpy sum implementation (binary version installed via pip) on an recent processor you have done something wrong, but also keep the varying implementation and compiler (fastmath) dependent precision in mind. I was too lazy to install clang but used Numba, which has also a LLVM backend (same as clang has).

import numba as nb
import numpy as np
import time

#prints information about SIMD vectorization
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')


@nb.njit(fastmath=True) #eq. O3, march-native,fastmath
def sum_nb(ar):
  s1=0. #double

  for i in range(ar.shape[0]):
    s1+=ar[i+0]

  return s1

N = int(1e7)
ar = np.random.rand(N).astype(np.float32)

#Numba solution float32 with float64 accumulator
#don't measure compilation time
sum_1=sum_nb(ar)
t1=time.time()
for i in range(1000):
  sum_1=sum_nb(ar)

print(time.time()-t1)

#Numba solution float64 with float64 accumulator
#don't measure compilation time
arr_64=ar.astype(np.float64)
sum_2=sum_nb(arr_64)
t1=time.time()
for i in range(1000):
  sum_2=sum_nb(arr_64)

print(time.time()-t1)

#Numpy solution (float32)
t1=time.time()
for i in range(1000):
  sum_3=np.sum(ar)

print(time.time()-t1)

#Numpy solution (float32, with float64 accumulator)
t1=time.time()
for i in range(1000):
  sum_4=np.sum(ar,dtype=np.float64)

print(time.time()-t1)

#Numpy solution (float64)
t1=time.time()
for i in range(1000):
  sum_5=np.sum(arr_64)

print(time.time()-t1)


print(sum_1)
print(sum_2)
print(sum_3)
print(sum_4)
print(sum_5)

Performance

#Numba solution float32 with float64 accumulator: 2.29ms
#Numba solution float64 with float64 accumulator: 4.76ms
#Numpy solution (float32): 5.72ms
#Numpy solution (float32) with float64 accumulator:: 7.97ms
#Numpy solution (float64):: 10.61ms

Why is numpy faster than my c/c++ code for summing an array of float?

Question

2 answers

solution1
2 2018-06-26 08:30:50

solution2
1 2018-06-27 10:35:26

You need to enable optimizations

Why is numpy faster than my c/c++ code for summing an array of float?

Question

2 answers

solution1 2 2018-06-26 08:30:50

solution2 1 2018-06-27 10:35:26

You need to enable optimizations

solution1
2 2018-06-26 08:30:50

solution2
1 2018-06-27 10:35:26