简体   繁体   English

使用另一个事件的 dataframe 屏蔽日期的 pandas dataframe

[英]Masking pandas dataframe of dates using another dataframe of events

I have a dataframe as follows我有一个 dataframe 如下

df = pd.DataFrame({"date": pd.date_range(start="2012-03-01", end="2012-03-05"),
                       "date+1": pd.date_range(start="2012-03-02", end="2012-03-06"),
                       "date+2": pd.date_range(start="2012-03-03", end="2012-03-07")})

I also have another dataframe representing events with start date and end date as follows.我还有另一个 dataframe 代表事件的开始日期和结束日期,如下所示。

event = pd.DataFrame({"event": ["A", "B"],
                      "start": ["2012-03-02", "2012-03-04"],
                      "end": ["2012-03-03", "2012-03-06"]})
event["start"] = pd.to_datetime(event["start"])
event["end"] = pd.to_datetime(event["end"])

I want to create a mask dataframe that return True if any date in df is between start date and end date of any event in the event dataframe.我想创建一个掩码 dataframe,如果 df 中的任何日期介于事件 dataframe 中任何事件的开始日期和结束日期之间,则返回 True。 The expected output should be预期的 output 应该是

0,  1,  1
1,  1,  1
1,  1,  1
1,  1,  1
1,  1,  0

This expected output correspond to the df这个预期的 output 对应 df

2012-03-01,  2012-03-02,  2012-03-03
2012-03-02,  2012-03-03,  2012-03-04
2012-03-03,  2012-03-04,  2012-03-05
2012-03-04,  2012-03-05,  2012-03-06
2012-03-05,  2012-03-06,  2012-03-07

As you can see that only 2012-03-01 and 2012-03-07 are not between any event in the event dataframe.如您所见,只有 2012-03-01 和 2012-03-07 不在事件 Z6A8064B5DF4794555500553C47C55057DZ 中的任何事件之间。 Looping could be computational expensive.循环可能是计算昂贵的。 May I have your suggestions how to minimize looping?我可以就如何最小化循环有你的建议吗?

You can use cartesian join, then check that the date is between start and end of an interval, and aggregate:您可以使用笛卡尔连接,然后检查日期是否在间隔的开始和结束之间,然后聚合:

# cartesian join
z = (df
    .stack().reset_index().assign(k=1)
    .merge(event.assign(k=1)))

# check if date between start and end
z['mask'] = z[0].between(z['start'], z['end'])

# aggregate
df_m = z.groupby(['level_0', 'level_1'])['mask'].max().unstack().astype(int)
df_m

Output: Output:

level_1  date  date+1  date+2
level_0                      
0           0       1       1
1           1       1       1
2           1       1       1
3           1       1       1
4           1       1       0

PS Instead of that trick with assigning k=1 to both frames before merging, if you're on a newer version of pandas (1.2.0+), you can use merge(how='cross') directly PS 如果您使用的是较新版本的pandas (1.2.0+),而不是在合并之前将k=1分配给两个帧的技巧,您可以直接使用merge(how='cross')

Create an interval index from events :events创建间隔索引

intervals = pd.IntervalIndex.from_tuples([*zip(event.start, event.end)], 
                                         closed = 'both')

IntervalIndex([[2012-03-02, 2012-03-03], [2012-03-04, 2012-03-06]],
              closed='both',
              dtype='interval[datetime64[ns]]')

Run applymap on df :df上运行applymap

df.applymap(lambda df: intervals.contains(df).any()).astype(int)
 
   date  date+1  date+2
0     0       1       1
1     1       1       1
2     1       1       1
3     1       1       1
4     1       1       0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM