简体   繁体   English

Python 中两个范围列表的交集

[英]Intersection of two lists of ranges in Python

A friend of mine passed me over an interview question he recently got and I wasn't very happy with my approach to the solution.我的一个朋友向我传递了他最近收到的一个面试问题,我对我的解决方案不太满意。 The question is as follows:问题如下:

  • You have two lists.你有两个列表。
  • Each list will contain lists of length 2, which represent a range (ie. [3,5] means a range from 3 to 5, inclusive).每个列表将包含长度为 2 的列表,代表一个范围(即 [3,5] 表示范围从 3 到 5,包括端点)。
  • You need to return the intersection of all ranges between the sets.您需要返回集合之间所有范围的交集。 If I give you [1,5] and [0,2], the result would be [1,2].如果我给你 [1,5] 和 [0,2],结果将是 [1,2]。
  • Within each list, the ranges will always increase and never overlap (ie it will be [[0, 2], [5, 10] ... ] never [[0,2], [2,5] ... ])在每个列表中,范围将始终增加并且永远不会重叠(即它将是 [[0, 2], [5, 10] ... ] 从不 [[0,2], [2,5] ... ] )

In general there are no "gotchas" in terms of the ordering or overlapping of the lists.一般来说,在列表的排序或重叠方面没有“陷阱”。

Example:示例:

a = [[0, 2], [5, 10], [13, 23], [24, 25]]
b = [[1, 5], [8, 12], [15, 18], [20, 24]]

Expected output: [[1, 2], [5, 5], [8, 10], [15, 18], [20, 24]]预期输出: [[1, 2], [5, 5], [8, 10], [15, 18], [20, 24]]

My lazy solution involved spreading the list of ranges into a list of integers then doing a set intersection, like this:我的懒惰解决方案涉及将范围列表扩展为整数列表,然后进行集合交集,如下所示:

def get_intersection(x, y):
    x_spread = [item for sublist in [list(range(l[0],l[1]+1)) for l in x] for item in sublist]
    y_spread = [item for sublist in [list(range(l[0],l[1]+1)) for l in y] for item in sublist]
    flat_intersect_list = list(set(x_spread).intersection(y_spread))
...

But I imagine there's a solution that's both readable and more efficient.但我想有一个既可读又更高效的解决方案。

Please explain how you would mentally tackle this problem, if you don't mind.如果您不介意,请说明您将如何在精神上解决这个问题。 A time/space complexity analysis would also be helpful.时间/空间复杂度分析也会有所帮助。

Thanks谢谢

[[max(first[0], second[0]), min(first[1], second[1])] 
  for first in a for second in b 
  if max(first[0], second[0]) <= min(first[1], second[1])]

A list comprehension which gives the answer: [[1, 2], [5, 5], [8, 10], [15, 18], [20, 23], [24, 24]]给出答案的列表推导式: [[1, 2], [5, 5], [8, 10], [15, 18], [20, 23], [24, 24]]

Breaking it down:分解它:

[[max(first[0], second[0]), min(first[1], second[1])] 

Maximum of the first term, Min of the 2nd term第一项的最大值,第二项的最小值

for first in a for second in b 

For all combinations of first and second term:对于第一项和第二项的所有组合:

if max(first[0], second[0]) <= min(first[1], second[1])]

Only if the max of the first does not exceed the minimum of the second.仅当第一个的最大值不超过第二个的最小值时。


If you need the output compacted, then the following function does that (In O(n^2) time because deletion from a list is O(n) , a step we perform O(n) times):如果您需要压缩输出,那么以下函数会执行此操作(在O(n^2)时间内,因为从列表中删除是O(n) ,我们执行O(n)次的步骤):

def reverse_compact(lst):
    for index in range(len(lst) - 2,-1,-1):
        if lst[index][1] + 1 >= lst[index + 1][0]:
            lst[index][1] = lst[index + 1][1]
            del lst[index + 1]  # remove compacted entry O(n)*
    return lst

It joins ranges which touch, given they are in-order .它加入了接触的范围,因为它们是有序的 It does it in reverse because then we can do this operation in place and delete the compacted entries as we go.它反向执行,因为这样我们就可以就地执行此操作并随时删除压缩的条目。 If we didn't do it in reverse, deleting other entries would muck with our index.如果我们没有反向操作,删除其他条目会破坏我们的索引。

>>> reverse_compact(comp)
[[1, 2], [5, 5], [8, 10], [15, 18], [20, 24]]
  • The compacting function can be reduced further to O(n) by doing a forward in place compaction and copying back the elements, as then each inner step is O(1) (get/set instead of del), but this is less readable:通过执行前向压缩并将元素复制回,压缩函数可以进一步减少到O(n) ,因为每个内部步骤都是O(1) (获取/设置而不是 del),但这不太可读:

This runs in O(n) time and space complexity:这以O(n)时间和空间复杂度运行:

def compact(lst):
    next_index = 0  # Keeps track of the last used index in our result
    for index in range(len(lst) - 1):
        if lst[next_index][1] + 1 >= lst[index + 1][0]:
            lst[next_index][1] = lst[index + 1][1]
        else:    
            next_index += 1
            lst[next_index] = lst[index + 1]
    return lst[:next_index + 1]

Using either compactor, the list comprehension is the dominating term here, with time = O(n*m) , space = O(m+n) , as it compares all possible combinations of the two lists with no early outs.使用任一压缩器,列表理解是这里的主要术语,时间 = O(n*m) ,空间 = O(m+n) ,因为它比较了两个列表的所有可能组合,没有提前出局。 This does not take advantage of the ordered structure of the lists given in the prompt: you could exploit that structure to reduce the time complexity to O(n + m) as they always increase and never overlap, meaning you can do all comparisons in a single pass.并不需要在提示发出名单的有序结构的优势:你可以利用这个结构,以减少时间复杂度O(n + m)因为他们随时增加,从不重叠,这意味着你可以做所有的比较中单程。


Note there is more than one solution and hopefully you can solve the problem and then iteratively improve upon it.请注意,有不止一种解决方案,希望您可以解决问题,然后对其进行迭代改进。

A 100% correct answer which satisfies all possible inputs is not the goal of an interview question.满足所有可能输入的 100% 正确答案不是面试问题的目标。 It is to see how a person thinks and handles challenges, and whether they can reason about a solution.是看一个人如何思考和处理挑战,以及他们是否能够推理出解决方案。

In fact, if you give me a 100% correct, textbook answer, it's probably because you've seen the question before and you already know the solution... and therefore that question isn't helpful to me as an interviewer.事实上,如果你给我一个 100% 正确的教科书式答案,那可能是因为你以前看过这个问题并且你已经知道解决方案......因此这个问题对我作为面试官没有帮助。 'Check, can regurgitate solutions found on StackOverflow.' “检查,可以反刍在 StackOverflow 上找到的解决方案。” The idea is to watch you solve a problem, not regurgitate a solution.这个想法是看着你解决一个问题,而不是反刍一个解决方案。

Too many candidates miss the forest for the trees: Acknowledging shortcomings and suggesting solutions is the right way to go about an answer to an interview questions.太多的候选人只见树木不见森林:承认缺点并提出解决方案是回答面试问题的正确方法。 You don't have to have a solution, you have to show how you would approach the problem.您不必有解决方案,您必须展示您将如何解决问题。

Your solution is fine if you can explain it and detail potential issues with using it.如果您可以解释它并详细说明使用它的潜在问题,那么您的解决方案就很好。

I got my current job by failing to answer an interview question: After spending the majority of my time trying, I explained why my approach didn't work and the second approach I would try given more time, along with potential pitfalls I saw in that approach (and why I opted for my first strategy initially).我因为没有回答面试问题而得到了现在的工作:在花了大部分时间尝试之后,我解释了为什么我的方法不起作用,我尝试的第二种方法有更多的时间,以及我在其中看到的潜在陷阱方法(以及我最初选择第一个策略的原因)。

OP, I believe this solution works, and it runs in O(m+n) time where m and n are the lengths of the lists. OP,我相信这个解决方案有效,它在 O(m+n) 时间内运行,其中 m 和 n 是列表的长度。 (To be sure, make ranges a linked list so that changing its length runs in constant time.) (可以肯定的是,将ranges设为链表,以便在恒定时间内更改其长度。)

def intersections(a,b):
    ranges = []
    i = j = 0
    while i < len(a) and j < len(b):
        a_left, a_right = a[i]
        b_left, b_right = b[j]

        if a_right < b_right:
            i += 1
        else:
            j += 1

        if a_right >= b_left and b_right >= a_left:
            end_pts = sorted([a_left, a_right, b_left, b_right])
            middle = [end_pts[1], end_pts[2]]
            ranges.append(middle)

    ri = 0
    while ri < len(ranges)-1:
        if ranges[ri][1] == ranges[ri+1][0]:
            ranges[ri:ri+2] = [[ranges[ri][0], ranges[ri+1][1]]]

        ri += 1

    return ranges

a = [[0,2], [5,10], [13,23], [24,25]]
b = [[1,5], [8,12], [15,18], [20,24]]
print(intersects(a,b))
# [[1, 2], [5, 5], [8, 10], [15, 18], [20, 24]]

Algorithm算法

Given two intervals, if they overlap, then the intersection's starting point is the maximum of the starting points of the two intervals, and its stopping point is the minimum of the stopping points:给定两个区间,如果它们重叠,则交点的起点是两个区间起点的最大值,其停止点是停止点的最小值:

相交区间图 相交区间图

To find all the pairs of intervals that might intersect, start with the first pair and keep incrementing the interval with the lower stopping point:要找到所有可能相交的区间对,从第一对开始,并用较低的停止点继续增加区间:

算法动画

At most m + n pairs of intervals are considered, where m is length of the first list, and n is the length of the second list.最多考虑m + n对区间,其中m是第一个列表的长度, n是第二个列表的长度。 Calculating the intersection of a pair of intervals is done in constant time, so this algorithm's time-complexity is O(m+n) .计算一对区间的交集是在恒定时间内完成的,因此该算法的时间复杂度为O(m+n)

Implementation实施

To keep the code simple, I'm using Python's built-in range object for the intervals.为了保持代码简单,我使用 Python 的内置range对象作为间隔。 This is a slight deviation from the problem description in that ranges are half-open intervals rather than closed.这与问题描述略有偏差,因为范围是半开区间而不是闭区间。 That is,也就是说,

(x in range(a, b)) == (a <= x < b)

Given two range objects x and y , their intersection is range(start, stop) , where start = max(x.start, y.start) and stop = min(x.stop, y.stop) .给定两个range对象xy ,它们的交集是range(start, stop) ,其中start = max(x.start, y.start)stop = min(x.stop, y.stop) If the two ranges don't overlap, then start >= stop and you just get an empty range:如果这两个范围不重叠,则start >= stop并且您只会得到一个空范围:

>>> len(range(1, 0))
0

So given two lists of ranges, xs and ys , each increasing in start value, the intersection can be computed as follows:因此,给定两个范围列表xsys ,每个列表的起始值都增加,交集可以计算如下:

def intersect_ranges(xs, ys):

    # Merge any abutting ranges (implementation below):
    xs, ys = merge_ranges(xs), merge_ranges(ys)

    # Try to get the first range in each iterator:
    try:
        x, y = next(xs), next(ys)
    except StopIteration:
        return

    while True:
        # Yield the intersection of the two ranges, if it's not empty:
        intersection = range(
            max(x.start, y.start),
            min(x.stop, y.stop)
        )
        if intersection:
            yield intersection

        # Try to increment the range with the earlier stopping value:
        try:
            if x.stop <= y.stop:
                x = next(xs)
            else:
                y = next(ys)
        except StopIteration:
            return

It seems from your example that the ranges can abut.从您的示例看来,这些范围可以邻接。 So any abutting ranges have to be merged first:因此必须首先合并任何邻接范围:

def merge_ranges(xs):
    start, stop = None, None
    for x in xs:
        if stop is None:
            start, stop = x.start, x.stop
        elif stop < x.start:
            yield range(start, stop)
            start, stop = x.start, x.stop
        else:
            stop = x.stop
    yield range(start, stop)

Applying this to your example:将此应用于您的示例:

>>> a = [[0, 2], [5, 10], [13, 23], [24, 25]]
>>> b = [[1, 5], [8, 12], [15, 18], [20, 24]]
>>> list(intersect_ranges(
...     (range(i, j+1) for (i, j) in a),
...     (range(i, j+1) for (i, j) in b)
... ))
[range(1, 3), range(5, 6), range(8, 11), range(15, 19), range(20, 25)]

I know this question already got a correct answer.我知道这个问题已经得到了正确的答案。 For completeness, I would like to mention I developed some time ago a Python library, namely portion ( https://github.com/AlexandreDecan/portion ) that supports this kind of operations (intersections between list of atomic intervals).为了完整起见,我想提一下我前段时间开发了一个 Python 库,即支持这种操作(原子间隔列表之间的交集)的portionhttps://github.com/AlexandreDecan/portion )。

You can have a look at the implementation, it's quite close to some of the answers that were provided here: https://github.com/AlexandreDecan/portion/blob/master/portion/interval.py#L406你可以看看实现,它非常接近这里提供的一些答案: https : //github.com/AlexandreDecan/portion/blob/master/portion/interval.py#L406

To illustrate its usage, let's consider your example:为了说明其用法,让我们考虑您的示例:

a = [[0, 2], [5, 10], [13, 23], [24, 25]]
b = [[1, 5], [8, 12], [15, 18], [20, 24]]

We need to convert these "items" to closed (atomic) intervals first:我们需要首先将这些“项目”转换为闭合(原子)间隔:

import portion as P

a = [P.closed(x, y) for x, y in a]
b = [P.closed(x, y) for x, y in b]

print(a)

... displays [[0,2], [5,10], [13,23], [24,25]] (each [x,y] is an Interval object). ... 显示[[0,2], [5,10], [13,23], [24,25]] (每个[x,y]是一个Interval对象)。

Then we can create an interval that represents the union of these atomic intervals:然后我们可以创建一个区间来表示这些原子区间的并集:

a = P.Interval(*a)
b = P.Interval(*b)

print(b)

... displays [0,2] | [5,10] | [13,23] | [24,25] ... 显示[0,2] | [5,10] | [13,23] | [24,25] [0,2] | [5,10] | [13,23] | [24,25] [0,2] | [5,10] | [13,23] | [24,25] (a single Interval object, representing the union of all the atomic ones). [0,2] | [5,10] | [13,23] | [24,25] (单个Interval对象,代表所有原子对象的并集)。

And now we can easily compute the intersection:现在我们可以轻松计算交集:

c = a & b
print(c)

... displays [1,2] | [5] | [8,10] | [15,18] | [20,23] | [24] ... 显示[1,2] | [5] | [8,10] | [15,18] | [20,23] | [24] [1,2] | [5] | [8,10] | [15,18] | [20,23] | [24] [1,2] | [5] | [8,10] | [15,18] | [20,23] | [24] . [1,2] | [5] | [8,10] | [15,18] | [20,23] | [24]

Notice that our answer differs from yours ( [20,23] | [24] instead of [20,24] ) since the library expects continuous domains for values.请注意,我们的答案与您的不同( [20,23] | [24]而不是[20,24] ),因为库期望值的连续域。 We can quite easily convert the results to discrete intervals following the approach proposed in https://github.com/AlexandreDecan/portion/issues/24#issuecomment-604456362 as follows:我们可以很容易地按照https://github.com/AlexandreDecan/portion/issues/24#issuecomment-604456362 中提出的方法将结果转换为离散区间,如下所示:

def discretize(i, incr=1):
  first_step = lambda s: (P.OPEN, (s.lower - incr if s.left is P.CLOSED else s.lower), (s.upper + incr if s.right is P.CLOSED else s.upper), P.OPEN)
  second_step = lambda s: (P.CLOSED, (s.lower + incr if s.left is P.OPEN and s.lower != -P.inf else s.lower), (s.upper - incr if s.right is P.OPEN and s.upper != P.inf else s.upper), P.CLOSED)
  return i.apply(first_step).apply(second_step)

print(discretize(c))

... displays [1,2] | [5] | [8,10] | [15,18] | [20,24] ... 显示[1,2] | [5] | [8,10] | [15,18] | [20,24] [1,2] | [5] | [8,10] | [15,18] | [20,24] [1,2] | [5] | [8,10] | [15,18] | [20,24] . [1,2] | [5] | [8,10] | [15,18] | [20,24]

I'm no kind of python programmer, but don't think this problem is amenable to slick Python-esque short solutions that are also efficient.我不是 Python 程序员,但不要认为这个问题适合于同样高效的 Python 式简短解决方案。

Mine treats the interval boundaries as "events" labeled 1 and 2, processing them in order.我的将区间边界视为标记为 1 和 2 的“事件”,按顺序处理它们。 Each event toggles the respective bit in a parity word.每个事件都会触发奇偶校验字中的相应位。 When we toggle to or from 3, it's time to emit the beginning or end of an intersection interval.当我们切换到 3 或从 3 切换时,是时候发出相交间隔的开始或结束。

The tricky part is that eg [13, 23], [24, 25] is being treated as [13, 25] ;棘手的部分是例如[13, 23], [24, 25]被视为[13, 25] adjacent intervals must be concatenated.相邻的间隔必须连接。 The nested if below takes care of this case by continuing the current interval rather than starting a new one.下面的嵌套if通过继续当前间隔而不是开始一个新间隔来处理这种情况。 Also, for equal event values, interval starts must be processed before ends so that eg [1, 5] and [5, 10] will be emitted as [5, 5] rather than nothing.此外,对于相等的事件值,必须在结束之前处理间隔开始,以便例如[1, 5][5, 10]将作为[5, 5]而不是什么都不发出。 That's handled with the middle field of the event tuples.这是用事件元组的中间字段处理的。

This implementation is O(n log n) due to the sorting, where n is the total length of both inputs.由于排序,此实现是 O(n log n),其中 n 是两个输入的总长度。 By merging the two event lists pairwise, it could be O(n), but this article suggests that the lists must be huge before the library merge will beat the library sort.通过成对合并两个事件列表,它可能是 O(n),但本文建议列表必须很大,然后库合并才能击败库排序。

def get_isect(a, b):
  events = (map(lambda x: (x[0], 0, 1), a) + map(lambda x: (x[1], 1, 1), a)
          + map(lambda x: (x[0], 0, 2), b) + map(lambda x: (x[1], 1, 2), b))
  events.sort()
  prevParity = 0
  isect = []
  for event in events:
    parity = prevParity ^ event[2]
    if parity == 3:
      # Maybe start a new intersection interval.
      if len(isect) == 0 or isect[-1][1] < event[0] - 1:
        isect.append([event[0], 0])
    elif prevParity == 3:
      # End the current intersection interval.
      isect[-1][1] = event[0]
    prevParity = parity
  return isect

Here is an O(n) version that's a bit more complex because it finds the next event on the fly by merging the input lists.这是一个 O(n) 版本,它有点复杂,因为它通过合并输入列表即时找到下一个事件。 It also requires only constant storage beyond inputs and output:它还只需要输入和输出之外的常量存储:

def get_isect2(a, b):
  ia = ib = prevParity = 0
  isect = []
  while True:
    aVal = a[ia / 2][ia % 2] if ia < 2 * len(a) else None
    bVal = b[ib / 2][ib % 2] if ib < 2 * len(b) else None
    if not aVal and not bVal: break
    if not bVal or aVal < bVal or (aVal == bVal and ia % 2 == 0):
      parity = prevParity ^ 1
      val = aVal
      ia += 1
    else:
      parity = prevParity ^ 2
      val = bVal
      ib += 1
    if parity == 3:
      if len(isect) == 0 or isect[-1][1] < val - 1:
        isect.append([val, 0])
    elif prevParity == 3:
      isect[-1][1] = val
    prevParity = parity
  return isect

Answering your question as I personally would probably answer an interview question and probably also most appreciate an answer;回答你的问题,因为我个人可能会回答一个面试问题,也可能最感谢你的回答; the interviewee's goal is probably to demonstrate a range of skills, not limited strictly to python.受访者的目标可能是展示一系列技能,而不仅限于 Python。 So this answer is admittedly going to be more abstract than others here.所以这个答案无疑会比这里的其他答案更抽象。

It might be helpful to ask for information about any constraints I'm operating under.询问有关我正在操作的任何约束的信息可能会有所帮助。 Operation time and space complexity are common constraints, as is development time, all of which are mentioned in previous answers here;操作时间和空间复杂度是常见的限制条件,开发时间也是如此,所有这些都在前面的答案中提到过; but other constraints might also arise.但也可能出现其他限制。 As common as any of those is maintenance and integration with existing code.与其中任何一个一样常见的是维护和与现有代码的集成。

Within each list, the ranges will always increase and never overlap在每个列表中,范围将始终增加并且永远不会重叠

When I see this, it probably means there is some pre-existing code to normalize the list of ranges, that sorts ranges and merges overlap.当我看到这个时,这可能意味着有一些预先存在的代码来规范化范围列表,对范围进行排序并合并重叠。 That's a pretty common union operation.这是一个很常见的联合操作。 When joining an existing team or ongoing project, one of the most important factors for success is integrating with existing patterns.加入现有团队或正在进行的项目时,成功的最重要因素之一是与现有模式集成。

Intersection operation can also be performed via a union operation.交运算也可以通过联合运算来执行。 Invert the sorted ranges, union them, and invert the result.反转排序的范围,合并它们,然后反转结果。

To me, that answer demonstrates experience with algorithms generally and "range" problems specifically, an appreciation that the most readable and maintainable code approach is typically reusing existing code, and a desire to help a team succeed over simply puzzling on my own.对我来说,这个答案展示了一般算法和“范围”问题的经验,最易读和可维护的代码方法通常是重用现有代码,并且希望帮助团队成功而不是我自己的困惑。

Another approach is to sort both lists together into one iterable list.另一种方法是将两个列表一起排序为一个可迭代列表。 Iterate the list, reference counting each start/end as increment/decrement steps.迭代列表,引用计数每个开始/结束作为增量/减量步骤。 Ranges are emitted on transitions between reference counts of 1 and 2. This approach is inherently extensible to support more than two lists, if the sort operation meets our needs (and they usually do).范围是在引用计数 1 和 2 之间的转换时发出的。如果排序操作满足我们的需要(它们通常会这样做),这种方法本质上是可扩展的以支持两个以上的列表。

Unless instructed otherwise, I would offer the general approaches and discuss reasons I might use each before writing code.除非另有说明,否则我将在编写代码之前提供一般方法并讨论我可能使用每种方法的原因。

So, there's no code here.所以,这里没有代码。 But you did ask for general approaches and thinking :D但是您确实要求提供一般方法和思考:D

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM