Avoiding (or speeding up) large loop in Python?

Question

I'm using SageMath to perform some mathematical calculations, and at one point I have a for loop that looks like this:

uni = {}
end = (l[idx]^(e[idx] - 1)) * (l[idx] + 1) # where end in my case is about 2013265922, 
                                           # but can also be much much larger too.
for count in range(0, end):
    i = randint(1, 303325737249669131)     # this executes very fast in Sage
    if i in uni:
        uni[i] += 1
    else:
        uni[i] = 1

So basically, I want to create very large number of random integers in the given range, check whether the number was already in the dictionary, if yes increment its count, if not initialize it to 1. But, the loop takes such a long time that it doesn't finish in a reasonable amount of time, and not because the operations inside the loop are complicated, but instead because there are a huge number of iterations to be performed. Therefore, I want to ask whether there is any way to avoid (or speed up) this kind of loops in Python?

Answer 1

I profiled your code (use cProfile for this) and the vast majority of the time spent, is spend within the randint function that is called for each iteration of the loop.

I recommend you vectorize the loop using numpy random number generation libraries, and then a single call to the Counter class to extract frequency counts.

import numpy.random
import numpy
from collections import Counter

assert 303325737249669131 < 18446744073709551615 # limit for uint64
numbers = numpy.random.randint(low=0, high=303325737249669131, size=end, 
dtype=numpy.uint64)
frequency = Counter(numbers)

For a loop of 1000000 iterations (smaller than the one you suggest) I observed a reduction from 6 seconds to about 1 second. So even with this you cannot expect more than an order of magnitude reduction in terms of computation time.

You may think that keeping an array of all the values in memory is inefficient, and may lead to memory exhaustion before the computation ends. However, due to the small value of "end" compared with the range of the random integers the rate at which you will be recording collisions is low, and therefore the memory cost of a full array is not significantly larger than storing the dictionary. However, if this becomes and issue you may wish to perform the computation in batches. In that spirit you may also want to use the multiprocessing facilities to distribute computations across many CPUs or even many machines (but lookout for network costs if you chose that).

Answer 2

Biggest speedup you can make without low-level magic is using defaultdict , ie

uni = defaultdict(int)
for count in range(0, end):
    i = randint(1, 303325737249669131)     # this executes very fast in Sage
    uni[i] += 1

If you're using python2, change range to xrange .

Except this - I'm pretty sure that its somewhere near limit for python. Loop is

generating random integer (optimized as much as possible without static typing)
calculating hash
updating dict. With defaultdict if-else branches is factored to more optimized code
from time to time, malloc calls to resize dict - this is fast (considering inablity to preallocate memory for dict)

Avoiding (or speeding up) large loop in Python?

Question

2 answers

solution1
1 2018-03-03 13:22:14

solution2
0 ACCPTED 2018-03-03 13:05:18

Avoiding (or speeding up) large loop in Python?

Question

2 answers

solution1 1 2018-03-03 13:22:14

solution2 0 ACCPTED 2018-03-03 13:05:18

solution1
1 2018-03-03 13:22:14

solution2
0 ACCPTED 2018-03-03 13:05:18