[英]How do I combine Pandas dataframes by looking at dates in one dataframe that fall within a date range in another dataframe?
I have two dataframes that have employee data as below. 我有两个具有员工数据的数据框,如下所示。 One data file has employee data including dates on which employees were sick, and the other data file has dates on which employees worked (ie presented as date ranges). 一个数据文件包含员工数据,其中包括员工生病的日期,而另一个数据文件包含员工工作的日期(即,显示为日期范围)。 I would like to combine the two files (hopefully in pandas) by looking at where the "sick day" for a particular employee falls in a "work range". 我想通过查看特定雇员的“病假”在“工作范围”内的位置来合并这两个文件(希望是在熊猫中)。 For example, in image/data below, employee 1 was sick on 11/25/2015, 12/23/2015, and 10/12/2015. 例如,在下面的图像/数据中,员工1在11/25 / 2015、12 / 23/2015和10/12/2015患病。 These fall in the "work ranges" 11/21/2015 - 11/29/2015, 12/21/2015 - 12/29/2015, and 10/9/2015 - 10/17/2015, respectively. 这些分别属于“工作范围”,分别为11/21/2015-11/29 / 2015、12 / 21/2015-12/29/2015和10/9/2015-10/17/2015。
EMPLOYEE WORK DATES DATA: 员工工作日期数据:
╔══════════╦════════════╦════════════╗ ║ Employee ║ datein ║ dateout ║ ╠══════════╬════════════╬════════════╣ ║ 1 ║ 11/21/2015 ║ 11/29/2015 ║ ║ 2 ║ 12/9/2015 ║ 12/14/2015 ║ ║ 3 ║ 11/10/2015 ║ 11/19/2015 ║ ║ 4 ║ 11/11/2015 ║ 11/17/2015 ║ ║ 5 ║ 11/30/2015 ║ 12/8/2015 ║ ║ 1 ║ 12/21/2015 ║ 12/29/2015 ║ ║ 2 ║ 1/7/2016 ║ 1/12/2016 ║ ║ 3 ║ 12/10/2015 ║ 12/19/2015 ║ ║ 4 ║ 12/10/2015 ║ 12/16/2015 ║ ║ 5 ║ 12/30/2015 ║ 1/7/2016 ║ ║ 1 ║ 10/9/2015 ║ 10/17/2015 ║ ║ 2 ║ 10/27/2015 ║ 11/1/2015 ║ ║ 3 ║ 9/28/2015 ║ 10/7/2015 ║ ║ 4 ║ 9/29/2015 ║ 10/5/2015 ║ ╚══════════╩════════════╩════════════╝
EMPLOYEE SICK DATES DATA: 员工病假日期数据:
╔══════════╦════════════╦═══════════╗ ║ Employee ║ sickDate ║ sickness ║ ╠══════════╬════════════╬═══════════╣ ║ 1 ║ 11/25/2015 ║ flu ║ ║ 10 ║ 11/21/2015 ║ hd ║ ║ 21 ║ 9/20/2015 ║ other ║ ║ 1 ║ 12/23/2015 ║ other ║ ║ 4 ║ 12/13/2015 ║ vacationx ║ ║ 7 ║ 7/21/2015 ║ cough ║ ║ 3 ║ 10/1/2015 ║ rash ║ ║ 4 ║ 10/5/2015 ║ other ║ ║ 5 ║ 1/7/2016 ║ eyex ║ ║ 2 ║ 12/12/2015 ║ tanx ║ ║ 1 ║ 10/12/2015 ║ fatiguex ║ ╚══════════╩════════════╩═══════════╝
CONSOLIDATED DATA: 合并数据:
╔══════════╦════════════╦════════════╦════════════╦═══════════╗ ║ Employee ║ datein ║ dateout ║ sickDate ║ sickness ║ ╠══════════╬════════════╬════════════╬════════════╬═══════════╣ ║ 1 ║ 11/21/2015 ║ 11/29/2015 ║ 11/25/2015 ║ flu ║ ║ 2 ║ 12/9/2015 ║ 12/14/2015 ║ 12/12/2015 ║ tanx ║ ║ 3 ║ 11/10/2015 ║ 11/19/2015 ║ ║ ║ ║ 4 ║ 11/11/2015 ║ 11/17/2015 ║ ║ ║ ║ 5 ║ 11/30/2015 ║ 12/8/2015 ║ ║ ║ ║ 1 ║ 12/21/2015 ║ 12/29/2015 ║ 12/23/2015 ║ other ║ ║ 2 ║ 1/7/2016 ║ 1/12/2016 ║ ║ ║ ║ 3 ║ 12/10/2015 ║ 12/19/2015 ║ ║ ║ ║ 4 ║ 12/10/2015 ║ 12/16/2015 ║ 12/13/2015 ║ vacationx ║ ║ 5 ║ 12/30/2015 ║ 1/7/2016 ║ 1/7/2016 ║ eyex ║ ║ 1 ║ 10/9/2015 ║ 10/17/2015 ║ 10/12/2015 ║ fatiguex ║ ║ 2 ║ 10/27/2015 ║ 11/1/2015 ║ ║ ║ ║ 3 ║ 9/28/2015 ║ 10/7/2015 ║ 10/1/2015 ║ rash ║ ║ 4 ║ 9/29/2015 ║ 10/5/2015 ║ 10/5/2015 ║ other ║ ╚══════════╩════════════╩════════════╩════════════╩═══════════╝
How do I do that in pandas or python? 如何在Pandas或python中做到这一点? (Thank you for your help!) (谢谢您的帮助!)
You need to put this data to pd.DataFrame( ... )
as df1 and set_index('Employee')
您需要将此数据作为df1和set_index('Employee')
放入pd.DataFrame( ... )
set_index('Employee')
╔══════════╦════════════╦════════════╗ ║ Employee ║ datein ║ dateout ║ ╠══════════╬════════════╬════════════╣ ║ 1 ║ 11/21/2015 ║ 11/29/2015 ║ ║ 2 ║ 12/9/2015 ║ 12/14/2015 ║ ║ 3 ║ 11/10/2015 ║ 11/19/2015 ║ ║ 4 ║ 11/11/2015 ║ 11/17/2015 ║ ║ 5 ║ 11/30/2015 ║ 12/8/2015 ║ ║ 1 ║ 12/21/2015 ║ 12/29/2015 ║ ║ 2 ║ 1/7/2016 ║ 1/12/2016 ║ ║ 3 ║ 12/10/2015 ║ 12/19/2015 ║ ║ 4 ║ 12/10/2015 ║ 12/16/2015 ║ ║ 5 ║ 12/30/2015 ║ 1/7/2016 ║ ║ 1 ║ 10/9/2015 ║ 10/17/2015 ║ ║ 2 ║ 10/27/2015 ║ 11/1/2015 ║ ║ 3 ║ 9/28/2015 ║ 10/7/2015 ║ ║ 4 ║ 9/29/2015 ║ 10/5/2015 ║ ╚══════════╩════════════╩════════════╝
Then put this data to pd.DataFrame( ... )
as df2 and set_index('Employee')
然后将此数据作为df2和set_index('Employee')
放入pd.DataFrame( ... )
set_index('Employee')
╔══════════╦════════════╦═══════════╗ ║ Employee ║ sickDate ║ sickness ║ ╠══════════╬════════════╬═══════════╣ ║ 1 ║ 11/25/2015 ║ flu ║ ║ 10 ║ 11/21/2015 ║ hd ║ ║ 21 ║ 9/20/2015 ║ other ║ ║ 1 ║ 12/23/2015 ║ other ║ ║ 4 ║ 12/13/2015 ║ vacationx ║ ║ 7 ║ 7/21/2015 ║ cough ║ ║ 3 ║ 10/1/2015 ║ rash ║ ║ 4 ║ 10/5/2015 ║ other ║ ║ 5 ║ 1/7/2016 ║ eyex ║ ║ 2 ║ 12/12/2015 ║ tanx ║ ║ 1 ║ 10/12/2015 ║ fatiguex ║ ╚══════════╩════════════╩═══════════╝
Finally, df = df1.join(df2).reset_index()
最后, df = df1.join(df2).reset_index()
Consider an inner and outer pandas merge approach. 考虑内部和外部大熊猫合并方法。 Below assumes dates are in datetime
formats which may require conversion from string objects: 下面假设日期为datetime
格式,可能需要从字符串对象进行转换:
workdf['datein'] = pd.to_datetime(workdf['datein'])
workdf['dateout'] = pd.to_datetime(workdf['dateout'])
sickdf['sickDate'] = pd.to_datetime(sickdf['sickDate'])
# INNER MERGE ON BOTH DFs WHERE SICK DAYS REPEAT FOR MATCHING EMPLOYEE ROW IN WORK DAYS
mergedf = pd.merge(workdf, sickdf, on='Employee', how="inner")
# OUTER MERGE TO KEEP ALL WORK DAY RECORDS WITH FILTERED SICK DAYS DATA SET
finaldf = pd.merge(mergedf[(mergedf['sickDate'] - mergedf['datein'] >= 0) &
(mergedf['dateout'] - mergedf['sickDate'] >= 0)],
workdf, on=['Employee', 'datein', 'dateout'], how="outer")
finaldf = finaldf.sort(['Employee','datein','dateout']).reset_index(drop=True)
Result 结果
# Employee datein dateout sickDate sickness
#0 1 2015-10-09 2015-10-17 2015-10-12 fatiguex
#1 1 2015-11-21 2015-11-29 2015-11-25 flu
#2 1 2015-12-21 2015-12-29 2015-12-23 other
#3 2 2015-10-27 2015-11-01 NaT NaN
#4 2 2015-12-09 2015-12-14 2015-12-12 tanx
#5 2 2016-01-07 2016-01-12 NaT NaN
#6 3 2015-09-28 2015-10-07 2015-10-01 rash
#7 3 2015-11-10 2015-11-19 NaT NaN
#8 3 2015-12-10 2015-12-19 NaT NaN
#9 4 2015-09-29 2015-10-05 2015-10-05 other
#10 4 2015-11-11 2015-11-17 NaT NaN
#11 4 2015-12-10 2015-12-16 2015-12-13 vacationx
#12 5 2015-11-30 2015-12-08 NaT NaN
#13 5 2015-12-30 2016-01-07 2016-01-07 eyex
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.