简体   繁体   中英

Avoid nested for loops using List Comprehension and/or map

For a couple of days I've been struggling with how to optimize (not only make it look nicer) the 3 nested loops containing a conditional and a function call inside. What I have right now is the following:

def build_prolongation_operator(p,qs):
    '''
    p: dimension of the coarse basis
    q: dimension of the fine basis

    The prolongation operator describes the relationship between
    the coarse and fine bases:    
    V_coarse = np.dot(V_fine, I)
    '''

    q = sum(qs)

    I = np.zeros([q, p])

    for i in range(0, q):
        for j in range(0, p):
            for k in range(0, qs[j]):
                # if BV i is a child of j, we set I[i, j] = 1
                if i == f_map(j, k, qs):
                    I[i, j] = 1
                    break

    return I

where f_map is:

def f_map(i, j, q):
    '''
    Mapping which returns the index k of the fine basis vector which
    corresponds to the jth child of the ith coarse basis vector.    
    '''

    if j < 0 or j > q[i]:
        print('ERROR in f_map')
        return None

    result = j

    for k in range(0, i):
        result += q[k]

    return result

When profiling my whole code I get that build_prolongation_operator is called 45 times and f_map approximately 8.5 million times!!

Here is the picture:

图片

I have tried to do the same with list comprehension and a map, but without any luck.

Here is a sample of the inputs that build_prolongation_operator expects:

p = 10
qs = randint(3, size=p)

I dunno about bases and prolongation operators, but you should focus on the algorithm itself. This is almost always sound advice where optimisation is concerned.

Here's probably the crux -- and if not, it's something to get you started: The f_map computation does not depend on i , but you are recomputing it for each value of i . Since i ranges from zero to the sum of the values in qs , you'll save a ginormous amount of recomputation by caching the results; google "python memoize" and it'll practically write itself. Fix this and you are probably done, without any micro-optimizations.

You'll need enough space to store max(p) * max(qs[j]) values, but from the number of calls you report, this should not be much of an obstactle.

Try and check if this works,

for j in range(0,p):
    for k in range(0, qs[j]):
        # if BV i is a child of j, we set I[i,j] = 1
        val = f_map(j,k,qs)
        if I[val, j] == 0:
            I[val, j] = 1

For one thing, you really don't need p as a parameter to your function: len(qs) only needs to be called once, and it's extremely cheap. If your input is always a numpy array (and under the circumstances there is no reason it shouldn't be), qs.size will do as well.

Let's start by rewriting f_map . The loop there is just the cumulative sum of qs (but starting with zero), which you can pre-compute once (or at least once per call to the outer function).

def f_map(i, j, cumsum_q):
    return j + cumsum_q[i]

Where cumsum_q would be defined in build_prolongation_operator as

cumsum_q = np.roll(np.cumsum(qs), 1)
cumsum_q[0] = 0

I am sure you could appreciate the usefulness of having the same set of variable names inside f_map as you have in build_prolongation_operator . To make it even easier, we can just remove f_map entirely and use the expression it represents instead in your condition:

if i == k + cumsum_q[j]:
    I[i, j] = 1

The loop over k then means "if i is k + cumsum[j] for any k ", set the element to 1. If we rewrite the condition as i - cumsum_q[j] == k , you can see that we don't need a loop over k at all. i - cumsum_q[j] will be equal to some k in the range [0, qs[j]) if it is non-negative and strictly less than qs[j] . You can just check

if i >= cumsum_q[j] and i - cumsum_q[j] < qs[j]:
    I[i, j] = 1

This reduces your loop to one iteration per element of the matrix. You can't do better than that:

def build_prolongation_operator_optimized(qs):
    '''
    The prolongation operator describes the relationship between
    the coarse and fine bases:    
    V_coarse = np.dot(V_fine, I)
    '''
    qs = np.asanyarray(qs)
    p = qs.size
    cumsum_q = np.roll(np.cumsum(qs), 1)
    q = cumsum_q[0]
    cumsum_q[0] = 0

    I = np.zeros([q, p])

    for i in range(0, q):
        for j in range(0, p):
            # if BV i is a child of j, we set I[i, j] = 1
            if 0 <= i - cumsum_q[j] < qs[j]:
                I[i, j] = 1
    return I

Now that you now that you know the formula for each cell, you can have numpy compute the whole matrix for you in essentially one line using broadcasting:

def build_prolongation_operator_numpy(qs):
    qs = np.asanyarray(qs)
    cumsum_q = np.roll(np.cumsum(qs), 1)
    q = cumsum_q[0]
    cumsum_q[0] = 0
    i_ = np.arange(q).reshape(-1, 1)  # Make this a column vector
    return (i_ >= cumsum_q) & (i_ - cumsum_q < qs)

I ran a small demo to ensure that (A) The proposed solutions get the same result as your original, and (B) work faster:

In [1]: p = 10
In [2]: q = np.random.randint(3, size=p)

In [3]: ops = (
...     build_prolongation_operator(p, qs),
...     build_prolongation_operator_optimized(qs),
...     build_prolongation_operator_numpy(qs),
...     build_prolongation_operator_RaunaqJain(p, qs),
...     build_prolongation_operator_gboffi(p, qs),
... )

In [4]: np.array([[(op1 == op2).all() for op1 in ops] for op2 in ops])
Out[4]: 
array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

In [5]: %timeit build_prolongation_operator(p, qs)
321 µs ± 890 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: %timeit build_prolongation_operator_optimized(qs)
75.1 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit build_prolongation_operator_numpy(qs)
24.8 µs ± 77.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [8]: %timeit build_prolongation_operator_RaunaqJain(p, qs)
28.5 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [9]: %timeit build_prolongation_operator_gboffi(p, qs)
31.8 µs ± 772 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: %timeit build_prolongation_operator_gboffi2(p, qs)
26.6 µs ± 768 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

As you can see, the fastest option is the fully vectorized one, but @RaunaqJain's and @gboffi's options come a very close second.

Note

My vectorized solution creates a boolean array. If you don't want that either use I.astype(...) to convert to the desired dtype, or view it as a np.uint8 array: I.view(dtype=np.uint8) .

Here it is the optimized loop as proposed by Raunaq Jain in their answer

for j in range(0,p):
    for k in range(0, qs[j]):
        # if BV i is a child of j, we set I[i,j] = 1
            val = f_map(j,k,qs)
            if I[val, j] == 0:
                I[val, j] = 1

and here it is the f_map function, where I have edited the names of the arguments to reflect the names used by the caller

def f_map(j,k,qs):
    if k < 0 or k > qs[j]:
        print('ERROR in f_map')
        return None
    result = k
    for i in range(0, j):
        result += qs[i]
    return result

We start by noting that it's always 0 ≤ k < qs[j] , because of the definition of the loop on k , so that we can safely remove the sanity check and write

def f_map(j,k,qs):
    result = k
    for i in range(0, j):
        result += q[i]
    return result

Now, this is the doc string of the sum builtin

Signature: sum(iterable, start=0, /)
Docstring:
Return the sum of a 'start' value (default: 0) plus an iterable of numbers

When the iterable is empty, return the start value.
This function is intended specifically for use with numeric values and may reject non-numeric types.
Type: builtin_function_or_method

It is apparent that we can write

def f_map(j,k,qs):
    return sum(qs[:j], k)

and it is apparent also that we can do w/o the function call

for j in range(0,p):
    for k in range(0, qs[j]):
        # if BV i is a child of j, we set I[i,j] = 1
            val = sum(qs[:j], k)
            if I[val, j] == 0:
                I[val, j] = 1

Calling a built-in should be more efficient than a function call and a loop, shouldn't it?


Addressing Mad Physicist's remark

We can precompute the partial sums of qs to get a further speed-up

sqs = [sum(qs[:i]) for i in range(len(qs))] # there are faster ways...
...
for j in range(0,p):
    for k in range(0, qs[j]):
        # if BV i is a child of j, we set I[i,j] = 1
            val = k+sqs[j]
            if I[val, j] == 0:
                I[val, j] = 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM