Why don't numpy arrays remember that they've been sorted?

Question

This probably isn't a question specific to numpy but the question occurred to me was when I was trying to optimize a piece of code utilizing numpy arrays and I think it's a good example case.

My question is why numpy arrays don't "remember" whether they have been sorted or not. Wouldn't this would an obvious opportunity to improve performance when checking conditions expressed by the standard relational operators?

To illustrate, instantiate an explicitly unsorted array.

import numpy as np
x = np.arange(30000)

# unsorted array
y = np.random.choice(x, x.size, replace=False)

Then test a simple > conditional...

%timeit y > 20
# 15.1 µs ± 870 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit y > 25000
# 14.8 µs ± 349 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

This takes about the same time for any value, as you would expect (it has to check condition against every value in the array).

However, if we explicitly sort the array and then run the same test...

y.sort()
%timeit y > 20
# 14.8 µs ± 737 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit y > 25000
# 14.8 µs ± 515 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The results are more or less the same, which suggests the condition is still being checked against every value in the array.

It seems to me that if numpy arrays had a boolean attribute to indicate if the array has been explicitly sorted or not, then there would be an opportunity for performance gains by running something like this:

def sorted_greater_than(arr, z):
    l = len(arr)
    for i,v in enumerate(arr):
        if v > z:
            return np.array([False]*i + [True]*(l-i))
    return np.full(l,False)

Ie we know every value after i is greater than the value at i so if the value at i is greater than z then all the other values after i are also greater than z (and similarly for the other operators).

I am of course not suggesting that numpy has been poorly optimized, I am just wondering what I'm missing here? Is there something logically inconsistent about the notion of an object "remembering" if it has been sorted?

Answer 1

While it may improve performance when you need to know if the lists/arrays are sorted, it would slow down every manipulation you do on the list such as appending or inserting elements, replacing subsets. The tradeoff is going to be really bad for the most frequent use cases (which are not related to the sorted states).

At best, a list could retain the fact that it hasn't been changed since the last sort but even that would add extra memory that would almost never be needed.

If you need a list that retains its sorted state, you could create a class of you own and manage that state within it. (it may be a good exercise to understand the amount of overhead involved).

Why don't numpy arrays remember that they've been sorted?

Question

1 answers

solution1
0 2021-03-01 21:22:32

Why don't numpy arrays remember that they've been sorted?

Question

1 answers

solution1 0 2021-03-01 21:22:32

solution1
0 2021-03-01 21:22:32