[英]Remove Overlapping Intervals & Divide into Non-overlapping Intervals Python
[英]Get non-overlapping distinct intervals from a set of intervals Python
給定一組區間,我想從一組區間中找到不重疊的不同區間。
例如:
輸入: [[1,10], [5,20], [6,21], [17,25], [22,23], [24,50],[30,55], [60,70]]
Output: [[1,5], [21,22], [23,24], [25,30], [50,55], [60,70]]
我怎樣才能做到這一點?
我目前嘗試過的:
gene_bounds_list = [[1,10],[5,20], [6,21],[17,25],[22,23], [24,50],[30,55],[60,70]]
overlap_list = []
nonoverlap_list = []
nonoverlap_list.append(gene_bounds_list[0])
for i in range(1, len(gene_bounds_list)):
curr_gene_bounds = gene_bounds_list[i]
prev_gene_bounds = nonoverlap_list[-1]
if curr_gene_bounds[0]<prev_gene_bounds[0]:
if curr_gene_bounds[1]<prev_gene_bounds[0]: #case1
continue
if curr_gene_bounds[1] < prev_gene_bounds[1]: #case2
nonoverlap_list[-1][0] = curr_gene_bounds[1]
if curr_gene_bounds[1]>prev_gene_bounds[1]:
# previous gene was completely overlapping within current gene,
# so replace previous gene by current (bigger) gene and put previous gene into overlap list
overlap_list.append(nonoverlap_list[-1])
new_bound = [gene_bounds_list[i][0], gene_bounds_list[i][1]]
nonoverlap_list.pop()
nonoverlap_list.append([new_bound[0], new_bound[1]])
elif curr_gene_bounds[0] > prev_gene_bounds[0] and curr_gene_bounds[1] < prev_gene_bounds[1]:
# completely within another gene
overlap_list.append([curr_gene_bounds[0], curr_gene_bounds[1]])
elif curr_gene_bounds[0] < prev_gene_bounds[1]:
# partially overlapping with another gene
new_bound = [nonoverlap_list[-1][1], curr_gene_bounds[1]]
nonoverlap_list[-1][1] = curr_gene_bounds[0]
nonoverlap_list.append([new_bound[0], new_bound[1]])
else:
# not overlapping with another gene
nonoverlap_list.append([gene_bounds_list[i][0], gene_bounds_list[i][1]])
unique_data = [list(x) for x in set(tuple(x) for x in gene_bounds_list)]
within_overlapping_intervals = []
for small in overlap_list:
for master in unique_data:
if (small[0]==master[0] and small[1]==master[1]):
continue
if (small[0]>master[0] and small[1]<master[1]):
if(small not in within_overlapping_intervals):
within_overlapping_intervals.append([small[0], small[1]])
for o in within_overlapping_intervals:
nonoverlap_list.append(o) # append the overlapping intervals
nonoverlap_list.sort(key=lambda tup: tup[0])
flat_data = sorted([x for sublist in nonoverlap_list for x in sublist])
new_gene_intervals = [flat_data[i:i + 2] for i in range(0, len(flat_data), 2)]
print(new_gene_intervals)
但是,這給了我一個 output: [[1, 5], [6, 10], [17, 20], [21, 22], [23, 24], [25, 30], [50, 55], [60, 70]]
關於如何刪除不需要的間隔的任何想法?
這是一種方法。 這個想法是在任何時候跟蹤間隔的層數。 為此,我們在進入區間時添加一層,退出時移除一層。
我們首先構建開始和結束的排序列表。 為了識別一個值是開始還是結束,我們創建元組(start, 1)
或(end, -1)
。
然后,我們合並這兩個列表,按值排序,並遍歷結果列表(使用heapq.merge使這很容易)。 每次層數變為 1 時,我們就有一個非重疊區間的開始。 當它再次改變時,它就結束了。
from heapq import merge
def non_overlapping(data):
out = []
starts = sorted([(i[0], 1) for i in data]) # start of interval adds a layer of overlap
ends = sorted([(i[1], -1) for i in data]) # end removes one
layers = 0
current = []
for value, event in merge(starts, ends): # sorted by value, then ends (-1) before starts (1)
layers += event
if layers ==1: # start of a new non-overlapping interval
current.append(value)
elif current: # we either got out of an interval, or started an overlap
current.append(value)
out.append(current)
current = []
return out
data = [[1,10], [5,20], [6,21], [17,25], [22,23], [24,50] ,[30,55], [60,70]]
non_overlapping(data)
# [[1, 5], [21, 22], [23, 24], [25, 30], [50, 55], [60, 70]]
請注意,您在問題中輸入的預期答案是錯誤的(例如,它包含不屬於輸入數據的 45)
Plot 時間線上的間隔。 由於范圍僅為10**5
,我們可以使用 memory。 繪圖和掃描可以在線性時間內完成。
intervals = [[1,10], [5,20], [6,21], [17,25], [22,23], [24,50] ,[30,55], [60,70]]
max_ = max(intervals, key=lambda x: x[1])[1]
timeline = [0] * (max_ + 1)
# mark plots
for start, end in intervals:
timeline[start] += 1
timeline[end] -= 1
# make the timeline
for i in range(1, len(timeline)):
timeline[i] += timeline[i - 1]
# scan
result = []
for i, item in enumerate(timeline):
if i == 0:
continue
prev = timeline[i - 1]
if item == 1 and prev != 1:
result.append([i, i + 1])
elif item == 1 and prev == 1:
result[-1][1] = i + 1
end = i
print(result)
編輯:隨着范圍更新為 ~ 10^8
,這將不起作用。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.