简体   繁体   English

合并pandas(交集)中的两个系列时间间隔

[英]Merge two series of time intervals in pandas (intersection)

I have multiple lists of time intervals and I need to find the time intervals (intersection) that are common to all of them.我有多个时间间隔列表,我需要找到所有这些时间间隔(交叉点)。

Eg例如

a = [['2018-02-03 15:06:30', '2018-02-03 17:06:30'], # each line is read as [start, end]
     ['2018-02-05 10:30:30', '2018-02-05 10:36:30'],
     ['2018-02-05 11:30:30', '2018-02-05 11:42:32']]

b = [['2018-02-03 15:16:30', '2018-02-03 18:06:30'],
     ['2018-02-04 10:30:30', '2018-02-05 10:32:30']]

c = [['2018-02-01 15:00:30', '2018-02-05 18:06:30']]

The result would be结果将是

common_intv = [['2018-02-03 15:16:30','2018-02-03 17:06:30'],
               ['2018-02-05 10:30:30','2018-02-05 10:32:30']]

I've found this solution that should work also for time intervals but I was wondering whether there is a more efficient way to do it in pandas.我发现这个解决方案也应该适用于时间间隔,但我想知道在 pandas 中是否有更有效的方法来做到这一点。

The proposed solution in the link would process two lists at a time ie it would first find the common intervals between a and b , then put these common intervals inside a variable common , then find the common intervals between common and c and so on...链接中建议的解决方案将一次处理两个列表,即首先找到ab之间的公共间隔,然后将这些公共间隔放入变量common中,然后找到commonc之间的公共间隔等等。 .

Of course a global solution (considering all intervals at the same time) would be even better!当然,全局解决方案(同时考虑所有间隔)会更好!

You can use pandas.merge_asof in both directions to get a first selection and then carefully cleanup the resulting rows.您可以在两个方向上使用pandas.merge_asof来获得第一个选择,然后仔细清理结果行。 Code could be:代码可以是:

# build the dataframes and ensure Timestamp types
dfa = pd.DataFrame(a, columns=['start', 'end']).astype('datetime64[ns]')
dfb = pd.DataFrame(b, columns=['start', 'end']).astype('datetime64[ns]')
dfc = pd.DataFrame(c, columns=['start', 'end']).astype('datetime64[ns]')

# merge a and b
tmp = pd.concat([pd.merge_asof(dfa, dfb, on='start'),
                 pd.merge_asof(dfb, dfa, on='start')]
                ).sort_values('start').dropna()

# keep the minimum end and ensure end <= start
tmp = tmp.assign(end=np.minimum(tmp.end_x, tmp.end_y))[['start', 'end']]
tmp = tmp[tmp['start'] <= tmp['end']]

# merge c
tmp = pd.concat([pd.merge_asof(tmp, dfc, on='start'),
                 pd.merge_asof(dfc, tmp, on='start')]
                ).sort_values('start').dropna()

tmp = tmp.assign(end=np.minimum(tmp.end_x, tmp.end_y))[['start', 'end']]
tmp = tmp[tmp['start'] <= tmp['end']]

It gives as expected:它按预期给出:

                start                 end
0 2018-02-03 15:16:30 2018-02-03 17:06:30
1 2018-02-05 10:30:30 2018-02-05 10:32:30

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM