[英]Join two pandas dataframes with overlapping dates and add new rows with overlaps
我正在尝试使用两个数据框来解决问题: 1 - 带有电视数据的网格,它具有节目的开始和结束(时间)以及频道名称; 2 - 观众数据 - 它包含曲调的开始和结束(时间)、调谐到的频道和用户 ID;
当不同用户的日期重叠时,如何加入两个表并添加新行? 有点像下面的例子:
Dataframe 1:
渠道 | In_Hour | Out_Hour |
---|---|---|
频道_1 | 8:00 | 22:00 |
频道_2 | 22:00 | 22:01 |
频道_3 | 22:01 | 22:40 |
Dataframe 2:
渠道 | 程序 | 开始 | 结尾 |
---|---|---|---|
频道_1 | 一个 | 07:00 | 09:00 |
频道_1 | b | 09:00 | 12:40 |
频道_1 | c | 12:00 | 23:00 |
频道_1 | d | 23:00 | 23:30 |
频道_1 | e | 23:30 | 23:45 |
频道_2 | F | 21:00 | 23:40 |
频道_3 | G | 21:40 | 23:00 |
目标 Dataframe:
渠道 | 程序 | 开始 | 结尾 |
---|---|---|---|
频道_1 | 一个 | 08:00 | 09:00 |
频道_1 | b | 09:00 | 12:00 |
频道_1 | c | 12:00 | 22:00 |
频道_2 | F | 22:00 | 22:01 |
频道_3 | G | 22:01 | 22:40 |
设置:
import pandas as pd
df1 = pd.DataFrame({
'Channel': {0: 'Channel_1', 1: 'Channel_2', 2: 'Channel_3'},
'In_Hour': {0: '8:00', 1: '22:00', 2: '22:01'},
'Out_Hour': {0: '22:00', 1: '22:01', 2: '22:40'}
})
df1['In_Hour'] = pd.to_datetime(df1['In_Hour'])
df1['Out_Hour'] = pd.to_datetime(df1['Out_Hour'])
df2 = pd.DataFrame({
'Channel': {0: 'Channel_1', 1: 'Channel_1', 2: 'Channel_1', 3: 'Channel_1',
4: 'Channel_1', 5: 'Channel_2', 6: 'Channel_3'},
'Program': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g'},
'Start': {0: '07:00', 1: '09:00', 2: '12:00', 3: '23:00', 4: '23:30',
5: '21:00', 6: '21:40'},
'End': {0: '09:00', 1: '12:40', 2: '23:00', 3: '23:30', 4: '23:45',
5: '23:40', 6: '23:00'}
})
df2['Start'] = pd.to_datetime(df2['Start'])
df2['End'] = pd.to_datetime(df2['End'])
尝试将帧merging
在一起,使用掩码过滤掉不符合条件的行,使用apply
+ clip
确保每一行都在In_Hour
和Out_Hour
指定的开始和结束时间范围内。
# Merge Frames Together
df3 = df2.merge(df1, on='Channel')
# Start is before Out_Hour and End is after In_Hour
m1 = df3['Start'].lt(df3['Out_Hour']) & df3['End'].gt(df3['In_Hour'])
# Filter To Only Keep Rows that are within times
df3 = df3[m1].reset_index(drop=True)
df3 = df3[['Channel', 'Program']].join(
# Groupby Channel
df3.apply(
# Clip lower and upper bounds based on In_Hour and Out_Hour
lambda r: r[['Start', 'End']].clip(
lower=r['In_Hour'], upper=r['Out_Hour']
),
axis=1
)
)
# Fix Hour Formatting
df3['Start'] = df3['Start'].dt.strftime('%H:%M')
df3['End'] = df3['End'].dt.strftime('%H:%M')
df3
:
Channel Program Start End
0 Channel_1 a 08:00 09:00
1 Channel_1 b 09:00 12:40
2 Channel_1 c 12:00 22:00
3 Channel_2 f 22:00 22:01
4 Channel_3 g 22:01 22:40
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.