简体   繁体   中英

vector math — numpy vs iterators

The below code shows that iterators are MUCH faster than numpy.arrays (unless I'm doing something wrong).

import numpy as np
import itertools
import time

dim = 10000
arrays = [np.array((1, 2, 3)) for x in range(dim)]
iterators = [iter((1, 2, 3)) for x in range(dim)]

t_array = time.time()
print(sum(arrays))
print(time.time() - t_array)

# [10000 20000 30000]
# 0.016389131546020508


t_iterators = time.time()
print(list(sum(x) for x in zip(*iterators)))
print(time.time() - t_iterators)

# [10000, 20000, 30000]
# 0.0011029243469238281

And the iterator version will work with not only iterators, but np.arrays, lists, or tuples.

So, this is the place for objective questions, and I'm guessing there's an objective reason numpy is so often used for this kind of thing (based on what I've seen on the Internet).

What is that reason? Or am I doing something objectively wrong?

The problem is that this:

arrays = [np.array((1, 2, 3)) for x in range(dim)]

isn't an array, and this:

sum(arrays)

isn't a numpy operation.

Compare the timing with a list of arrays and the built-in sum :

>>> timeit.timeit('sum(arrays)', 'from __main__ import arrays', number=1000)
16.348400657162813

to what you get with a 2D array and numpy.sum :

>>> actual_array = numpy.array(arrays)
>>> timeit.timeit('numpy.sum(actual_array, axis=0)', 'from __main__ import actua
l_array; import numpy', number=1000)
0.20679712685881668

80x improvement. It beats the iterator version by a factor of 5. If you're going to use NumPy, you need to keep as much of the work in NumPy as possible.

I would say you are doing it wrong, but that is a matter of interpretation and depends on the details of the problem you are solving.

For the case presented you are storing a two dimensional numpy array as a list of numpy arrays and then using "list processing" routines. This circumvents some of the benefits/optimizations possible within numpy.

A slightly modified version of your case run in ipython (without running %pylab) is given below. Note in the example you are not using itertools, but the builtin iter() function.

import numpy as np

dim = 10000
arrays = [np.array((1, 2, 3)) for x in range(dim)]
iterators = [iter((1, 2, 3)) for x in range(dim)]

%timeit sum(arrays)
10 loops, best of 3: 20.8 ms per loop

%timeit list(sum(x) for x in zip(*iterators))
1000 loops, best of 3: 468 µs per loop

[Edited based on comment below.]

So the iterators look great but they have a limitation in that they can only be used once; after we iterate over them they are now "empty". So the correct test using %timeit would be to recreate the iterator every time.

def iter_test () :
    iterators = [iter((1, 2, 3)) for x in range(dim)]
    return list(sum(x) for x in zip(*iterators))

%timeit iter_test()
100 loops, best of 3: 4.06 ms per loop

Now we see it is (only) about 5 times faster than looping over arrays.

In pure numpy I would instead have done the following (the two dimensional array can be created in many ways)

nparrays=np.asarray(arrays)
%timeit np.sum(nparrays,axis=0)
1000 loops, best of 3: 279 µs per loop

So this is much faster, as it should be.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM