简体   繁体   English

从一组区间 Python 中获取不重叠的不同区间

[英]Get non-overlapping distinct intervals from a set of intervals Python

Given a set of intervals, I would like to find non-overlapping distinct intervals from a set of intervals.给定一组区间,我想从一组区间中找到不重叠的不同区间。

For example:例如:

Input: [[1,10], [5,20], [6,21], [17,25], [22,23], [24,50],[30,55], [60,70]]输入: [[1,10], [5,20], [6,21], [17,25], [22,23], [24,50],[30,55], [60,70]]

Output: [[1,5], [21,22], [23,24], [25,30], [50,55], [60,70]] Output: [[1,5], [21,22], [23,24], [25,30], [50,55], [60,70]]

How can I do that?我怎样才能做到这一点?

What I have currently tried:我目前尝试过的:

gene_bounds_list = [[1,10],[5,20], [6,21],[17,25],[22,23], [24,50],[30,55],[60,70]]
overlap_list = []
nonoverlap_list = []
nonoverlap_list.append(gene_bounds_list[0])

for i in range(1, len(gene_bounds_list)):
    curr_gene_bounds = gene_bounds_list[i]
    prev_gene_bounds = nonoverlap_list[-1]

    if curr_gene_bounds[0]<prev_gene_bounds[0]:
        if curr_gene_bounds[1]<prev_gene_bounds[0]: #case1
            continue
        if curr_gene_bounds[1] < prev_gene_bounds[1]:  #case2
            nonoverlap_list[-1][0] = curr_gene_bounds[1]
        if curr_gene_bounds[1]>prev_gene_bounds[1]:
            # previous gene was completely overlapping within current gene,
            # so replace previous gene by current (bigger) gene and put previous gene into overlap list
            overlap_list.append(nonoverlap_list[-1])
            new_bound = [gene_bounds_list[i][0], gene_bounds_list[i][1]]
            nonoverlap_list.pop()
            nonoverlap_list.append([new_bound[0], new_bound[1]])

    elif curr_gene_bounds[0] > prev_gene_bounds[0] and curr_gene_bounds[1] < prev_gene_bounds[1]:
        # completely within another gene
        overlap_list.append([curr_gene_bounds[0], curr_gene_bounds[1]])

    elif curr_gene_bounds[0] < prev_gene_bounds[1]:
        # partially overlapping with another gene
        new_bound = [nonoverlap_list[-1][1], curr_gene_bounds[1]]
        nonoverlap_list[-1][1] = curr_gene_bounds[0]
        nonoverlap_list.append([new_bound[0], new_bound[1]])

    else:
        # not overlapping with another gene
        nonoverlap_list.append([gene_bounds_list[i][0], gene_bounds_list[i][1]])

unique_data = [list(x) for x in set(tuple(x) for x in gene_bounds_list)]
within_overlapping_intervals = []

for small in overlap_list:
    for master in unique_data:
        if (small[0]==master[0] and small[1]==master[1]):
            continue
        if (small[0]>master[0] and small[1]<master[1]):
            if(small not in within_overlapping_intervals):
                within_overlapping_intervals.append([small[0], small[1]])

for o in within_overlapping_intervals:
    nonoverlap_list.append(o)  # append the overlapping intervals

nonoverlap_list.sort(key=lambda tup: tup[0])
flat_data = sorted([x for sublist in nonoverlap_list for x in sublist])
new_gene_intervals = [flat_data[i:i + 2] for i in range(0, len(flat_data), 2)]
print(new_gene_intervals)

However, this gives me an output of: [[1, 5], [6, 10], [17, 20], [21, 22], [23, 24], [25, 30], [50, 55], [60, 70]]但是,这给了我一个 output: [[1, 5], [6, 10], [17, 20], [21, 22], [23, 24], [25, 30], [50, 55], [60, 70]]

Any ideas of how I can remove the unwanted intervals?关于如何删除不需要的间隔的任何想法?

Here is a way to do it.这是一种方法。 The idea is to keep track of the number of layers of intervals at any point.这个想法是在任何时候跟踪间隔的层数。 For this, we add one layer when entering an interval, and remove one when exiting.为此,我们在进入区间时添加一层,退出时移除一层。

We start by building sorted lists of starts and ends.我们首先构建开始和结束的排序列表。 In order to identify the if a value is a start or an end, we create tuples (start, 1) or (end, -1) .为了识别一个值是开始还是结束,我们创建元组(start, 1)(end, -1)

Then, we merge these two lists, sorting by value, and iterate over the resulting list (using heapq.merge makes this easy).然后,我们合并这两个列表,按值排序,并遍历结果列表(使用heapq.merge使这很容易)。 Each time the number of layers changes to 1, we have the start of a non-overlapping interval.每次层数变为 1 时,我们就有一个非重叠区间的开始。 When it changes again, it's the end of it.当它再次改变时,它就结束了。

from heapq import merge


def non_overlapping(data):
    out = []
    starts = sorted([(i[0], 1) for i in data])  # start of interval adds a layer of overlap
    ends = sorted([(i[1], -1) for i in data])   # end removes one
    layers = 0
    current = []
    for value, event in merge(starts, ends):    # sorted by value, then ends (-1) before starts (1)
        layers += event
        if layers ==1:  # start of a new non-overlapping interval
            current.append(value)
        elif current:  # we either got out of an interval, or started an overlap
            current.append(value)
            out.append(current)
            current = []
    return out


data = [[1,10], [5,20], [6,21], [17,25], [22,23], [24,50] ,[30,55], [60,70]]

non_overlapping(data)
# [[1, 5], [21, 22], [23, 24], [25, 30], [50, 55], [60, 70]]

Note that the expected answer you put in your question is wrong (for example, it contains a 45 that isn't part of the input data)请注意,您在问题中输入的预期答案是错误的(例如,它包含不属于输入数据的 45)

Plot the intervals on a timeline. Plot 时间线上的间隔。 As the range is just 10**5 , we can use the memory.由于范围仅为10**5 ,我们可以使用 memory。 Plotting and scanning can be done in linear time.绘图和扫描可以在线性时间内完成。

intervals = [[1,10], [5,20], [6,21], [17,25], [22,23], [24,50] ,[30,55], [60,70]]

max_ = max(intervals, key=lambda x: x[1])[1]
timeline = [0] * (max_ + 1)

# mark plots
for start, end in intervals:
    timeline[start] += 1
    timeline[end] -= 1


# make the timeline
for i in range(1, len(timeline)):
    timeline[i] += timeline[i - 1]


# scan
result = []
for i, item in enumerate(timeline):
    if i == 0:
        continue
    prev = timeline[i - 1]
    if item == 1 and prev != 1:
        result.append([i, i + 1])
    elif item == 1 and prev == 1:
        result[-1][1] = i + 1
        end = i


print(result)

EDIT: As the range is updated to ~ 10^8 , this won't work.编辑:随着范围更新为 ~ 10^8 ,这将不起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM