This is an harder version of this question but I couldn't solve it efficiently ( preferably without the need to import libraries).
Let's say that I have some list:
lst = [[1,2],[1,4],[1,6],[2,6],[2,3],[3,5],[7,8]]
And let's say that I have a list of intervals:
intervals = [0,3,5,8]
I want to keep in each interval one sublist by 1st element and the one that have the highest 2nd element. In this example it means that there will be only one sublist which the 1st element is between 0 & 3, one sublist which the 1st element is between 3 & 5, etc... so the result will be:
result:
>>> [[1,6],[3,5],[7,8]]
To be noted:
Here are three solutions, ordered by performance:
Create two lists for first/second number in each element. It increases memory usage but is the fastest option.
Use key
parameter in max
to get the element with highest second number. Avoids duplicating memory usage, but is about 30% slower. This could be a good middle ground.
Use itertools.groupby
with a key function
that gets the interval of the first number in each element. It can be used for more robust applications, but is not as efficient as for each element it iterates Intervals
until it finds the matching interval. It is almost 3x slower than the first option.
Option 1: create two lists
Separating the list into two lists for first/second number of each element.
# sort and separate lst
lst = sorted(lst)
first = [e[0] for e in lst]
second = [e[1] for e in lst]
# iterate upper limits of intervals and get max of each sublist
i = k = 0
keep = []
while lst[i][0] < Intervals[0]:
i += 1
for upper in Intervals[1:]:
k = sum(f < upper for f in first[i:])
keep.append(i + second[i:i+k].index(max(second[i:i+k])))
i += k
result = [lst[i] for i in keep]
print(result)
Output
[[1, 6], [3, 5], [7, 8]]
Option 2: use max(lst, key)
You can get the element with the maximum second number with max(lst, key=lambda x: x[1])
. Here is the implementation for the intervals.
lst = sorted(lst)
i = k = 0
result = []
for upper in Intervals:
i += k
# old solution summed a generator
# k = sum(e[0] < upper for e in lst[i:])
# this one uses a while-loop to avoid checking the rest of the list on each iteration
# a good idea if `lst` is long and `Intervals` are many
k = 0
while i + k < len(lst) and lst[i+k][0] < upper:
k += 1
if upper == Intervals[0]:
continue
result.append(max(lst[i:i+k], key=lambda x:x[1]))
Output
[[1, 6], [3, 5], [7, 8]]
Option 3: itertools.groubpy(lst, key)
from itertools import groupby
def get_bin(element, bins):
x = element[0]
if x < bins[0]:
return -1
elif x in bins:
return bins.index(x)
else:
for i, b in enumerate(bins[1:]):
if x < b:
break
return i
result = sorted([
max(items, key=lambda x: x[1])
for _, items in groupby(lst, lambda x: get_bin(x, Intervals))
])
Output
[[1, 6], [3, 5], [7, 8]]
For simplicity:
lst = [[1,2],[1,4],[1,6],[2,6],[2,3],[3,5],[7,8]]
intervals = [0,3,5,8] #usually, variables starts lowercase
I'll demonstrate how to split list into several groups by indices from intervals
and then return maximum items of each group here. You can use a trick which I would like to call 'shift` of array:
def get_groups(lst, intervals):
return [lst[i:j] for i,j in zip(intervals[:-1], intervals[1:])]
This is a nice way to construct tuples of slices that are: (0, 3)
, (3, 5)
, (5, 8)
. Now you have:
>>> groups = get_groups(lst, interval)
>>> groups
[[[1, 2], [1, 4], [1, 6]],
[[2, 6], [2, 3]],
[[3, 5], [7, 8]]]
And then you extract maximum elements when sorting by second column:
>>> [max(n, key = lambda x: x[1]) for n in groups]
[[1, 6], [2, 6], [7, 8]]
If it's important to distinguish between two items that has the same values of second column:
[max(n, key = lambda x: (x[1], x[0])) for n in groups]
OP required, in contrast, to split list into several groups by values that falls into intervals
. It's possible to build an algorithm on top of first result if list is presorted and we are doing a single search of array in order to find indices where elements should be inserted to maintain order. In that case get_groups
should be redefined as follows:
def get_groups(lst, intervals):
lst = sorted(lst)
firstcolumn = [n[0] for n in lst]
intervals = searchsorted(first_column, intervals)
return [lst[i:j] for i,j in zip(intervals[:-1], intervals[1:])]
At the moment you can also use adapted version of RichieV's answer:
def searchsorted(array, intervals):
idx, i, n = [], 0, len(array)
for upper in intervals:
while array[i] < upper:
i += 1
if i == n:
idx.append(n)
return idx
else:
idx.append(i)
return idx
>>> searchsorted([1,1,1,2,2,3,7], [0,3,5,8])
[0, 5, 6, 7]
Note that get_groups
is not quite optimal because both first_column
and lst
is being iterated twice.
Usage:
def simple_sol(lst, intervals):
return [max(n, key=lambda x: x[1]) for n in get_groups(lst, intervals)]
#Output: [[1, 6], [3, 5], [7, 8]]
I've wrote a definition of searchsorted inspired by alternative method np.searchsorted
which is based on binary search instead. It's also more efficient ( O(m log(n))
vs O(mn)
). For Python version see also docs and source code of bisect.bisect_left
and related answer about binary search. This is double win, C-level + binary search (pretty much the same as my previous answer ):
def binsorted(lst, intervals):
lst = np.array(lst)
lst = lst[np.argsort(lst[:,0])] #sorting lst by first row
idx = np.searchsorted(lst[:,0], intervals)
if idx[-1] == len(lst):
return np.maximum.reduceat(lst, idx[:-1], axis=0)
else:
return np.maximum.reduceat(lst, idx, axis=0)
#Output: [[2, 6], [3, 5], [7, 8]]
I compared option1
, option2
, option3
, simple_sol
and binsorting
for the samples:
lst = np.random.randint(1000, size = (1000000, 2)).tolist()
intervals = np.unique(np.random.randint(1000, size = 100)).tolist() + [1000]
and timeit
s were:
18.4 s ± 472 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.21 s ± 386 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.3 s ± 410 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.12 s ± 202 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.38 s ± 97.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.