I have two dataframes DF1 and DF2.
DF1:
StartDate
1/1/2013
2/1/2013
11/1/2014
4/1/2014
5/1/2015
DF2:
EmploymentType EmpStatus EmpStartDate
Employee Active 11/5/2012
Employee Active 9/10/2012
Employee Active 10/15/2013
Employee Active 10/29/2013
Employee Terminated 10/29/2013
Contractor Terminated 11/20/2014
Contractor Active 11/20/2014
I want the count of rows from DF2 where EmploymentType = 'Employee' and EmpStatus = 'Active' and EmpStartDate<= Start Date of DF1
Output:
Start Date Count
1/1/2013 2
2/1/2013 2
11/1/2014 4
4/1/2014 4
5/1/2015 4
How do I achieve this without merging the two dataframes?
I cannot merge the dataframes since there are no common keys and since I need the count of rows based on conditions, I cannot join the dataframes on any temporary columns as I need to avoid cross-join.
You can do it using a cartesian join and filtering if your dataframes are too big:
(df1.assign(key=1)
.merge(df2.query('EmploymentType == "Employee" and EmpStatus=="Active"').assign(key=1),
on='key')
.query('EmpStartDate <= StartDate')
.groupby('StartDate')['key'].count())
Output:
StartDate
2013-01-01 2
2013-02-01 2
2014-04-01 4
2014-11-01 4
2015-05-01 4
Name: key, dtype: int64
query
to including EmploymentType and EmpStatus equal to Employee and Active respectively.merge
on dummy key to create a cartesian join of all records.query
to filter results of join to only those records where EmpStartDate is less than or equal to StartDate.groupby
StartDate and count
. Also, note that using query
is a shortcut. If your column names contain special character or a space, then you'll need to filter your dataframes using boolean indexing.
pd.merge_asof(df2.query('EmploymentType == "Employee" and EmpStatus == "Active"').sort_values('EmpStartDate'),
df1.sort_values('StartDate'),
left_on='EmpStartDate',
right_on='StartDate',
direction='forward')\
.groupby('StartDate')['EmploymentType'].count()\
.reindex(df1.StartDate.sort_values())\
.cumsum()\
.ffill()
Output:
StartDate
2013-01-01 2.0
2013-02-01 2.0
2014-04-01 4.0
2014-11-01 4.0
2015-05-01 4.0
Name: EmploymentType, dtype: float64
Details:
Use pd.merge_asof
to join df2 filter down to df1 to the nearest forward-looking date.
groupby
the start date joined on from df1 and count.
reindex
results by df.startdate to fill in missing/zero value for the start datescumsum
to mimic <= functionality and sum.fillna
to populate missing records with previous sums.def compensation(x):
return DF2[DF2['EmpStartDate']<x
and DF2['EmpStatus']=='Active'].shape[0]
DF1['Count']=DF1['StartDate']
.apply(lambda x:
compensation(x),axis=1)
The method is Boolean indexing and counting rows. https://pandas.pydata.org/pandas-docs/stable/indexing.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.