I have two dataframes that have employee data as below. One data file has employee data including dates on which employees were sick, and the other data file has dates on which employees worked (ie presented as date ranges). I would like to combine the two files (hopefully in pandas) by looking at where the "sick day" for a particular employee falls in a "work range". For example, in image/data below, employee 1 was sick on 11/25/2015, 12/23/2015, and 10/12/2015. These fall in the "work ranges" 11/21/2015 - 11/29/2015, 12/21/2015 - 12/29/2015, and 10/9/2015 - 10/17/2015, respectively.
EMPLOYEE WORK DATES DATA:
╔══════════╦════════════╦════════════╗ ║ Employee ║ datein ║ dateout ║ ╠══════════╬════════════╬════════════╣ ║ 1 ║ 11/21/2015 ║ 11/29/2015 ║ ║ 2 ║ 12/9/2015 ║ 12/14/2015 ║ ║ 3 ║ 11/10/2015 ║ 11/19/2015 ║ ║ 4 ║ 11/11/2015 ║ 11/17/2015 ║ ║ 5 ║ 11/30/2015 ║ 12/8/2015 ║ ║ 1 ║ 12/21/2015 ║ 12/29/2015 ║ ║ 2 ║ 1/7/2016 ║ 1/12/2016 ║ ║ 3 ║ 12/10/2015 ║ 12/19/2015 ║ ║ 4 ║ 12/10/2015 ║ 12/16/2015 ║ ║ 5 ║ 12/30/2015 ║ 1/7/2016 ║ ║ 1 ║ 10/9/2015 ║ 10/17/2015 ║ ║ 2 ║ 10/27/2015 ║ 11/1/2015 ║ ║ 3 ║ 9/28/2015 ║ 10/7/2015 ║ ║ 4 ║ 9/29/2015 ║ 10/5/2015 ║ ╚══════════╩════════════╩════════════╝
EMPLOYEE SICK DATES DATA:
╔══════════╦════════════╦═══════════╗ ║ Employee ║ sickDate ║ sickness ║ ╠══════════╬════════════╬═══════════╣ ║ 1 ║ 11/25/2015 ║ flu ║ ║ 10 ║ 11/21/2015 ║ hd ║ ║ 21 ║ 9/20/2015 ║ other ║ ║ 1 ║ 12/23/2015 ║ other ║ ║ 4 ║ 12/13/2015 ║ vacationx ║ ║ 7 ║ 7/21/2015 ║ cough ║ ║ 3 ║ 10/1/2015 ║ rash ║ ║ 4 ║ 10/5/2015 ║ other ║ ║ 5 ║ 1/7/2016 ║ eyex ║ ║ 2 ║ 12/12/2015 ║ tanx ║ ║ 1 ║ 10/12/2015 ║ fatiguex ║ ╚══════════╩════════════╩═══════════╝
CONSOLIDATED DATA:
╔══════════╦════════════╦════════════╦════════════╦═══════════╗ ║ Employee ║ datein ║ dateout ║ sickDate ║ sickness ║ ╠══════════╬════════════╬════════════╬════════════╬═══════════╣ ║ 1 ║ 11/21/2015 ║ 11/29/2015 ║ 11/25/2015 ║ flu ║ ║ 2 ║ 12/9/2015 ║ 12/14/2015 ║ 12/12/2015 ║ tanx ║ ║ 3 ║ 11/10/2015 ║ 11/19/2015 ║ ║ ║ ║ 4 ║ 11/11/2015 ║ 11/17/2015 ║ ║ ║ ║ 5 ║ 11/30/2015 ║ 12/8/2015 ║ ║ ║ ║ 1 ║ 12/21/2015 ║ 12/29/2015 ║ 12/23/2015 ║ other ║ ║ 2 ║ 1/7/2016 ║ 1/12/2016 ║ ║ ║ ║ 3 ║ 12/10/2015 ║ 12/19/2015 ║ ║ ║ ║ 4 ║ 12/10/2015 ║ 12/16/2015 ║ 12/13/2015 ║ vacationx ║ ║ 5 ║ 12/30/2015 ║ 1/7/2016 ║ 1/7/2016 ║ eyex ║ ║ 1 ║ 10/9/2015 ║ 10/17/2015 ║ 10/12/2015 ║ fatiguex ║ ║ 2 ║ 10/27/2015 ║ 11/1/2015 ║ ║ ║ ║ 3 ║ 9/28/2015 ║ 10/7/2015 ║ 10/1/2015 ║ rash ║ ║ 4 ║ 9/29/2015 ║ 10/5/2015 ║ 10/5/2015 ║ other ║ ╚══════════╩════════════╩════════════╩════════════╩═══════════╝
How do I do that in pandas or python? (Thank you for your help!)
You need to put this data to pd.DataFrame( ... )
as df1 and set_index('Employee')
╔══════════╦════════════╦════════════╗ ║ Employee ║ datein ║ dateout ║ ╠══════════╬════════════╬════════════╣ ║ 1 ║ 11/21/2015 ║ 11/29/2015 ║ ║ 2 ║ 12/9/2015 ║ 12/14/2015 ║ ║ 3 ║ 11/10/2015 ║ 11/19/2015 ║ ║ 4 ║ 11/11/2015 ║ 11/17/2015 ║ ║ 5 ║ 11/30/2015 ║ 12/8/2015 ║ ║ 1 ║ 12/21/2015 ║ 12/29/2015 ║ ║ 2 ║ 1/7/2016 ║ 1/12/2016 ║ ║ 3 ║ 12/10/2015 ║ 12/19/2015 ║ ║ 4 ║ 12/10/2015 ║ 12/16/2015 ║ ║ 5 ║ 12/30/2015 ║ 1/7/2016 ║ ║ 1 ║ 10/9/2015 ║ 10/17/2015 ║ ║ 2 ║ 10/27/2015 ║ 11/1/2015 ║ ║ 3 ║ 9/28/2015 ║ 10/7/2015 ║ ║ 4 ║ 9/29/2015 ║ 10/5/2015 ║ ╚══════════╩════════════╩════════════╝
Then put this data to pd.DataFrame( ... )
as df2 and set_index('Employee')
╔══════════╦════════════╦═══════════╗ ║ Employee ║ sickDate ║ sickness ║ ╠══════════╬════════════╬═══════════╣ ║ 1 ║ 11/25/2015 ║ flu ║ ║ 10 ║ 11/21/2015 ║ hd ║ ║ 21 ║ 9/20/2015 ║ other ║ ║ 1 ║ 12/23/2015 ║ other ║ ║ 4 ║ 12/13/2015 ║ vacationx ║ ║ 7 ║ 7/21/2015 ║ cough ║ ║ 3 ║ 10/1/2015 ║ rash ║ ║ 4 ║ 10/5/2015 ║ other ║ ║ 5 ║ 1/7/2016 ║ eyex ║ ║ 2 ║ 12/12/2015 ║ tanx ║ ║ 1 ║ 10/12/2015 ║ fatiguex ║ ╚══════════╩════════════╩═══════════╝
Finally, df = df1.join(df2).reset_index()
Consider an inner and outer pandas merge approach. Below assumes dates are in datetime
formats which may require conversion from string objects:
workdf['datein'] = pd.to_datetime(workdf['datein'])
workdf['dateout'] = pd.to_datetime(workdf['dateout'])
sickdf['sickDate'] = pd.to_datetime(sickdf['sickDate'])
# INNER MERGE ON BOTH DFs WHERE SICK DAYS REPEAT FOR MATCHING EMPLOYEE ROW IN WORK DAYS
mergedf = pd.merge(workdf, sickdf, on='Employee', how="inner")
# OUTER MERGE TO KEEP ALL WORK DAY RECORDS WITH FILTERED SICK DAYS DATA SET
finaldf = pd.merge(mergedf[(mergedf['sickDate'] - mergedf['datein'] >= 0) &
(mergedf['dateout'] - mergedf['sickDate'] >= 0)],
workdf, on=['Employee', 'datein', 'dateout'], how="outer")
finaldf = finaldf.sort(['Employee','datein','dateout']).reset_index(drop=True)
Result
# Employee datein dateout sickDate sickness
#0 1 2015-10-09 2015-10-17 2015-10-12 fatiguex
#1 1 2015-11-21 2015-11-29 2015-11-25 flu
#2 1 2015-12-21 2015-12-29 2015-12-23 other
#3 2 2015-10-27 2015-11-01 NaT NaN
#4 2 2015-12-09 2015-12-14 2015-12-12 tanx
#5 2 2016-01-07 2016-01-12 NaT NaN
#6 3 2015-09-28 2015-10-07 2015-10-01 rash
#7 3 2015-11-10 2015-11-19 NaT NaN
#8 3 2015-12-10 2015-12-19 NaT NaN
#9 4 2015-09-29 2015-10-05 2015-10-05 other
#10 4 2015-11-11 2015-11-17 NaT NaN
#11 4 2015-12-10 2015-12-16 2015-12-13 vacationx
#12 5 2015-11-30 2015-12-08 NaT NaN
#13 5 2015-12-30 2016-01-07 2016-01-07 eyex
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.