简体   繁体   中英

How do I combine Pandas dataframes by looking at dates in one dataframe that fall within a date range in another dataframe?

I have two dataframes that have employee data as below. One data file has employee data including dates on which employees were sick, and the other data file has dates on which employees worked (ie presented as date ranges). I would like to combine the two files (hopefully in pandas) by looking at where the "sick day" for a particular employee falls in a "work range". For example, in image/data below, employee 1 was sick on 11/25/2015, 12/23/2015, and 10/12/2015. These fall in the "work ranges" 11/21/2015 - 11/29/2015, 12/21/2015 - 12/29/2015, and 10/9/2015 - 10/17/2015, respectively.

EMPLOYEE WORK DATES DATA:

 ╔══════════╦════════════╦════════════╗ ║ Employee ║ datein ║ dateout ║ ╠══════════╬════════════╬════════════╣ ║ 1 ║ 11/21/2015 ║ 11/29/2015 ║ ║ 2 ║ 12/9/2015 ║ 12/14/2015 ║ ║ 3 ║ 11/10/2015 ║ 11/19/2015 ║ ║ 4 ║ 11/11/2015 ║ 11/17/2015 ║ ║ 5 ║ 11/30/2015 ║ 12/8/2015 ║ ║ 1 ║ 12/21/2015 ║ 12/29/2015 ║ ║ 2 ║ 1/7/2016 ║ 1/12/2016 ║ ║ 3 ║ 12/10/2015 ║ 12/19/2015 ║ ║ 4 ║ 12/10/2015 ║ 12/16/2015 ║ ║ 5 ║ 12/30/2015 ║ 1/7/2016 ║ ║ 1 ║ 10/9/2015 ║ 10/17/2015 ║ ║ 2 ║ 10/27/2015 ║ 11/1/2015 ║ ║ 3 ║ 9/28/2015 ║ 10/7/2015 ║ ║ 4 ║ 9/29/2015 ║ 10/5/2015 ║ ╚══════════╩════════════╩════════════╝ 

EMPLOYEE SICK DATES DATA:

 ╔══════════╦════════════╦═══════════╗ ║ Employee ║ sickDate ║ sickness ║ ╠══════════╬════════════╬═══════════╣ ║ 1 ║ 11/25/2015 ║ flu ║ ║ 10 ║ 11/21/2015 ║ hd ║ ║ 21 ║ 9/20/2015 ║ other ║ ║ 1 ║ 12/23/2015 ║ other ║ ║ 4 ║ 12/13/2015 ║ vacationx ║ ║ 7 ║ 7/21/2015 ║ cough ║ ║ 3 ║ 10/1/2015 ║ rash ║ ║ 4 ║ 10/5/2015 ║ other ║ ║ 5 ║ 1/7/2016 ║ eyex ║ ║ 2 ║ 12/12/2015 ║ tanx ║ ║ 1 ║ 10/12/2015 ║ fatiguex ║ ╚══════════╩════════════╩═══════════╝ 

CONSOLIDATED DATA:

 ╔══════════╦════════════╦════════════╦════════════╦═══════════╗ ║ Employee ║ datein ║ dateout ║ sickDate ║ sickness ║ ╠══════════╬════════════╬════════════╬════════════╬═══════════╣ ║ 1 ║ 11/21/2015 ║ 11/29/2015 ║ 11/25/2015 ║ flu ║ ║ 2 ║ 12/9/2015 ║ 12/14/2015 ║ 12/12/2015 ║ tanx ║ ║ 3 ║ 11/10/2015 ║ 11/19/2015 ║ ║ ║ ║ 4 ║ 11/11/2015 ║ 11/17/2015 ║ ║ ║ ║ 5 ║ 11/30/2015 ║ 12/8/2015 ║ ║ ║ ║ 1 ║ 12/21/2015 ║ 12/29/2015 ║ 12/23/2015 ║ other ║ ║ 2 ║ 1/7/2016 ║ 1/12/2016 ║ ║ ║ ║ 3 ║ 12/10/2015 ║ 12/19/2015 ║ ║ ║ ║ 4 ║ 12/10/2015 ║ 12/16/2015 ║ 12/13/2015 ║ vacationx ║ ║ 5 ║ 12/30/2015 ║ 1/7/2016 ║ 1/7/2016 ║ eyex ║ ║ 1 ║ 10/9/2015 ║ 10/17/2015 ║ 10/12/2015 ║ fatiguex ║ ║ 2 ║ 10/27/2015 ║ 11/1/2015 ║ ║ ║ ║ 3 ║ 9/28/2015 ║ 10/7/2015 ║ 10/1/2015 ║ rash ║ ║ 4 ║ 9/29/2015 ║ 10/5/2015 ║ 10/5/2015 ║ other ║ ╚══════════╩════════════╩════════════╩════════════╩═══════════╝ 


How do I do that in pandas or python? (Thank you for your help!)

You need to put this data to pd.DataFrame( ... ) as df1 and set_index('Employee')

 ╔══════════╦════════════╦════════════╗ ║ Employee ║ datein ║ dateout ║ ╠══════════╬════════════╬════════════╣ ║ 1 ║ 11/21/2015 ║ 11/29/2015 ║ ║ 2 ║ 12/9/2015 ║ 12/14/2015 ║ ║ 3 ║ 11/10/2015 ║ 11/19/2015 ║ ║ 4 ║ 11/11/2015 ║ 11/17/2015 ║ ║ 5 ║ 11/30/2015 ║ 12/8/2015 ║ ║ 1 ║ 12/21/2015 ║ 12/29/2015 ║ ║ 2 ║ 1/7/2016 ║ 1/12/2016 ║ ║ 3 ║ 12/10/2015 ║ 12/19/2015 ║ ║ 4 ║ 12/10/2015 ║ 12/16/2015 ║ ║ 5 ║ 12/30/2015 ║ 1/7/2016 ║ ║ 1 ║ 10/9/2015 ║ 10/17/2015 ║ ║ 2 ║ 10/27/2015 ║ 11/1/2015 ║ ║ 3 ║ 9/28/2015 ║ 10/7/2015 ║ ║ 4 ║ 9/29/2015 ║ 10/5/2015 ║ ╚══════════╩════════════╩════════════╝ 

Then put this data to pd.DataFrame( ... ) as df2 and set_index('Employee')

 ╔══════════╦════════════╦═══════════╗ ║ Employee ║ sickDate ║ sickness ║ ╠══════════╬════════════╬═══════════╣ ║ 1 ║ 11/25/2015 ║ flu ║ ║ 10 ║ 11/21/2015 ║ hd ║ ║ 21 ║ 9/20/2015 ║ other ║ ║ 1 ║ 12/23/2015 ║ other ║ ║ 4 ║ 12/13/2015 ║ vacationx ║ ║ 7 ║ 7/21/2015 ║ cough ║ ║ 3 ║ 10/1/2015 ║ rash ║ ║ 4 ║ 10/5/2015 ║ other ║ ║ 5 ║ 1/7/2016 ║ eyex ║ ║ 2 ║ 12/12/2015 ║ tanx ║ ║ 1 ║ 10/12/2015 ║ fatiguex ║ ╚══════════╩════════════╩═══════════╝ 

Finally, df = df1.join(df2).reset_index()

Consider an inner and outer pandas merge approach. Below assumes dates are in datetime formats which may require conversion from string objects:

workdf['datein'] = pd.to_datetime(workdf['datein'])
workdf['dateout'] = pd.to_datetime(workdf['dateout'])
sickdf['sickDate'] = pd.to_datetime(sickdf['sickDate'])

# INNER MERGE ON BOTH DFs WHERE SICK DAYS REPEAT FOR MATCHING EMPLOYEE ROW IN WORK DAYS
mergedf = pd.merge(workdf, sickdf, on='Employee', how="inner")

# OUTER MERGE TO KEEP ALL WORK DAY RECORDS WITH FILTERED SICK DAYS DATA SET
finaldf = pd.merge(mergedf[(mergedf['sickDate'] - mergedf['datein'] >= 0) &
                           (mergedf['dateout'] - mergedf['sickDate'] >= 0)],
                   workdf, on=['Employee', 'datein', 'dateout'], how="outer")

finaldf = finaldf.sort(['Employee','datein','dateout']).reset_index(drop=True)

Result

#    Employee     datein      dateout     sickDate   sickness
#0          1 2015-10-09   2015-10-17   2015-10-12   fatiguex
#1          1 2015-11-21   2015-11-29   2015-11-25        flu
#2          1 2015-12-21   2015-12-29   2015-12-23      other
#3          2 2015-10-27   2015-11-01          NaT        NaN
#4          2 2015-12-09   2015-12-14   2015-12-12       tanx
#5          2 2016-01-07   2016-01-12          NaT        NaN
#6          3 2015-09-28   2015-10-07   2015-10-01       rash
#7          3 2015-11-10   2015-11-19          NaT        NaN
#8          3 2015-12-10   2015-12-19          NaT        NaN
#9          4 2015-09-29   2015-10-05   2015-10-05      other
#10         4 2015-11-11   2015-11-17          NaT        NaN
#11         4 2015-12-10   2015-12-16   2015-12-13  vacationx
#12         5 2015-11-30   2015-12-08          NaT        NaN
#13         5 2015-12-30   2016-01-07   2016-01-07       eyex  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM