简体   繁体   中英

Join two pandas dataframes with overlapping dates and add new rows with overlaps

I am trying to solve a problem using two dataframes: 1 - Grid with TV data, it has the beginning and end (time) of the show and the channel name; 2 - Viewers data - It has the beginning and end (time) of the tune, the channel that it was tunned to and the user ID;

How can I join both tables and add new rows when there is overlap on the dates for different users? Kind of like the example below:

Dataframe 1:

Channel In_Hour Out_Hour
Channel_1 8:00 22:00
Channel_2 22:00 22:01
Channel_3 22:01 22:40

Dataframe 2:

Channel Program Start End
Channel_1 a 07:00 09:00
Channel_1 b 09:00 12:40
Channel_1 c 12:00 23:00
Channel_1 d 23:00 23:30
Channel_1 e 23:30 23:45
Channel_2 f 21:00 23:40
Channel_3 g 21:40 23:00

Objective Dataframe:

Channel Program Start End
Channel_1 a 08:00 09:00
Channel_1 b 09:00 12:00
Channel_1 c 12:00 22:00
Channel_2 f 22:00 22:01
Channel_3 g 22:01 22:40

Setup:

import pandas as pd

df1 = pd.DataFrame({
    'Channel': {0: 'Channel_1', 1: 'Channel_2', 2: 'Channel_3'},
    'In_Hour': {0: '8:00', 1: '22:00', 2: '22:01'},
    'Out_Hour': {0: '22:00', 1: '22:01', 2: '22:40'}
})

df1['In_Hour'] = pd.to_datetime(df1['In_Hour'])
df1['Out_Hour'] = pd.to_datetime(df1['Out_Hour'])

df2 = pd.DataFrame({
    'Channel': {0: 'Channel_1', 1: 'Channel_1', 2: 'Channel_1', 3: 'Channel_1',
                4: 'Channel_1', 5: 'Channel_2', 6: 'Channel_3'},
    'Program': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g'},
    'Start': {0: '07:00', 1: '09:00', 2: '12:00', 3: '23:00', 4: '23:30',
              5: '21:00', 6: '21:40'},
    'End': {0: '09:00', 1: '12:40', 2: '23:00', 3: '23:30', 4: '23:45',
            5: '23:40', 6: '23:00'}
})

df2['Start'] = pd.to_datetime(df2['Start'])
df2['End'] = pd.to_datetime(df2['End'])

Try merging the frames together, use a mask to filter out rows that don't fall within criteria, use apply + clip to ensure that every row falls within the start and end time specified in In_Hour and Out_Hour .

# Merge Frames Together
df3 = df2.merge(df1, on='Channel')

# Start is before Out_Hour and End is after In_Hour
m1 = df3['Start'].lt(df3['Out_Hour']) & df3['End'].gt(df3['In_Hour'])

# Filter To Only Keep Rows that are within times
df3 = df3[m1].reset_index(drop=True)

df3 = df3[['Channel', 'Program']].join(
    # Groupby Channel
    df3.apply(
        # Clip lower and upper bounds based on In_Hour and Out_Hour
        lambda r: r[['Start', 'End']].clip(
            lower=r['In_Hour'], upper=r['Out_Hour']
        ),
        axis=1
    )
)

# Fix Hour Formatting
df3['Start'] = df3['Start'].dt.strftime('%H:%M')
df3['End'] = df3['End'].dt.strftime('%H:%M')

df3 :

     Channel Program  Start    End
0  Channel_1       a  08:00  09:00
1  Channel_1       b  09:00  12:40
2  Channel_1       c  12:00  22:00
3  Channel_2       f  22:00  22:01
4  Channel_3       g  22:01  22:40

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM