简体   繁体   中英

How to sort a list of lists and and to keep only the maximal 2nd element of each of the 1st elements by intervals?

This is an harder version of this question but I couldn't solve it efficiently ( preferably without the need to import libraries).

Let's say that I have some list:

lst = [[1,2],[1,4],[1,6],[2,6],[2,3],[3,5],[7,8]]

And let's say that I have a list of intervals:

intervals = [0,3,5,8]

I want to keep in each interval one sublist by 1st element and the one that have the highest 2nd element. In this example it means that there will be only one sublist which the 1st element is between 0 & 3, one sublist which the 1st element is between 3 & 5, etc... so the result will be:

result:
>>> [[1,6],[3,5],[7,8]]

To be noted:

  • It is not very important if it will be in such way as {0 =< x < 3} or {0 < x =< 3} as long as there are no duplicates.
  • It is better that if we have, for example, [1,6] and [2,6] in the same interval that the one that will be kept is the one with the lowest 1st element ( [1,6] )

Here are three solutions, ordered by performance:

  1. Create two lists for first/second number in each element. It increases memory usage but is the fastest option.

  2. Use key parameter in max to get the element with highest second number. Avoids duplicating memory usage, but is about 30% slower. This could be a good middle ground.

  3. Use itertools.groupby with a key function that gets the interval of the first number in each element. It can be used for more robust applications, but is not as efficient as for each element it iterates Intervals until it finds the matching interval. It is almost 3x slower than the first option.


Option 1: create two lists

Separating the list into two lists for first/second number of each element.

# sort and separate lst
lst = sorted(lst)
first = [e[0] for e in lst]
second = [e[1] for e in lst]

# iterate upper limits of intervals and get max of each sublist
i = k = 0
keep = []
while lst[i][0] < Intervals[0]:
    i += 1
for upper in Intervals[1:]:
    k = sum(f < upper for f in first[i:])
    keep.append(i + second[i:i+k].index(max(second[i:i+k])))
    i += k

result = [lst[i] for i in keep]
print(result)

Output

[[1, 6], [3, 5], [7, 8]]

Option 2: use max(lst, key)

You can get the element with the maximum second number with max(lst, key=lambda x: x[1]) . Here is the implementation for the intervals.

lst = sorted(lst)

i = k = 0
result = []
for upper in Intervals:
    i += k
    # old solution summed a generator
    # k = sum(e[0] < upper for e in lst[i:])
    # this one uses a while-loop to avoid checking the rest of the list on each iteration
    # a good idea if `lst` is long and `Intervals` are many
    k = 0
    while i + k < len(lst) and lst[i+k][0] < upper: 
        k += 1
    if upper == Intervals[0]:
        continue
    result.append(max(lst[i:i+k], key=lambda x:x[1]))

Output

[[1, 6], [3, 5], [7, 8]]

Option 3: itertools.groubpy(lst, key)

from itertools import groupby

def get_bin(element, bins):
    x = element[0]
    if x < bins[0]:
        return -1
    elif x in bins:
        return bins.index(x)
    else:
        for i, b in enumerate(bins[1:]):
            if x < b:
                break
        return i
        

result = sorted([
    max(items, key=lambda x: x[1])
    for _, items in groupby(lst, lambda x: get_bin(x, Intervals))
])

Output

[[1, 6], [3, 5], [7, 8]]

For simplicity:

lst = [[1,2],[1,4],[1,6],[2,6],[2,3],[3,5],[7,8]]
intervals = [0,3,5,8] #usually, variables starts lowercase

Initial version (not an answer yet)

I'll demonstrate how to split list into several groups by indices from intervals and then return maximum items of each group here. You can use a trick which I would like to call 'shift` of array:

def get_groups(lst, intervals):
    return [lst[i:j] for i,j in zip(intervals[:-1], intervals[1:])]

This is a nice way to construct tuples of slices that are: (0, 3) , (3, 5) , (5, 8) . Now you have:

>>> groups = get_groups(lst, interval)
>>> groups
[[[1, 2], [1, 4], [1, 6]], 
 [[2, 6], [2, 3]], 
 [[3, 5], [7, 8]]]

And then you extract maximum elements when sorting by second column:

>>> [max(n, key = lambda x: x[1]) for n in groups]
[[1, 6], [2, 6], [7, 8]]

If it's important to distinguish between two items that has the same values of second column:

[max(n, key = lambda x: (x[1], x[0])) for n in groups]

Final version

OP required, in contrast, to split list into several groups by values that falls into intervals . It's possible to build an algorithm on top of first result if list is presorted and we are doing a single search of array in order to find indices where elements should be inserted to maintain order. In that case get_groups should be redefined as follows:

def get_groups(lst, intervals):
    lst = sorted(lst)
    firstcolumn = [n[0] for n in lst]
    intervals = searchsorted(first_column, intervals)
    return [lst[i:j] for i,j in zip(intervals[:-1], intervals[1:])]

At the moment you can also use adapted version of RichieV's answer:

def searchsorted(array, intervals):
    idx, i, n = [], 0, len(array)
    for upper in intervals:
        while array[i] < upper:
            i += 1
            if i == n:
                idx.append(n)
                return idx
        else:
            idx.append(i)
    return idx

>>> searchsorted([1,1,1,2,2,3,7], [0,3,5,8])
[0, 5, 6, 7]

Note that get_groups is not quite optimal because both first_column and lst is being iterated twice.

Usage:

def simple_sol(lst, intervals):
    return [max(n, key=lambda x: x[1]) for n in get_groups(lst, intervals)]
#Output: [[1, 6], [3, 5], [7, 8]]

Further optimisations

I've wrote a definition of searchsorted inspired by alternative method np.searchsorted which is based on binary search instead. It's also more efficient ( O(m log(n)) vs O(mn) ). For Python version see also docs and source code of bisect.bisect_left and related answer about binary search. This is double win, C-level + binary search (pretty much the same as my previous answer ):

def binsorted(lst, intervals):
    lst = np.array(lst)
    lst = lst[np.argsort(lst[:,0])] #sorting lst by first row
    idx = np.searchsorted(lst[:,0], intervals)
    if idx[-1] == len(lst):
        return np.maximum.reduceat(lst, idx[:-1], axis=0)
    else:
        return np.maximum.reduceat(lst, idx, axis=0)

#Output: [[2, 6], [3, 5], [7, 8]]

Benchmarking

I compared option1 , option2 , option3 , simple_sol and binsorting for the samples:

lst = np.random.randint(1000, size = (1000000, 2)).tolist()
intervals = np.unique(np.random.randint(1000, size = 100)).tolist() + [1000]

and timeit s were:

18.4 s ± 472 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.21 s ± 386 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.3 s ± 410 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.12 s ± 202 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.38 s ± 97.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM