[英]Compare columns of two dataframes without merging the dataframes
I have two dataframes DF1 and DF2.我有两个数据帧 DF1 和 DF2。
DF1: DF1:
StartDate
1/1/2013
2/1/2013
11/1/2014
4/1/2014
5/1/2015
DF2: DF2:
EmploymentType EmpStatus EmpStartDate
Employee Active 11/5/2012
Employee Active 9/10/2012
Employee Active 10/15/2013
Employee Active 10/29/2013
Employee Terminated 10/29/2013
Contractor Terminated 11/20/2014
Contractor Active 11/20/2014
I want the count of rows from DF2 where EmploymentType = 'Employee' and EmpStatus = 'Active' and EmpStartDate<= Start Date of DF1我想要来自 DF2 的行数,其中 EmploymentType = 'Employee' 和 EmpStatus = 'Active' 和 EmpStartDate<= DF1 的开始日期
Output:输出:
Start Date Count
1/1/2013 2
2/1/2013 2
11/1/2014 4
4/1/2014 4
5/1/2015 4
How do I achieve this without merging the two dataframes?如何在不合并两个数据帧的情况下实现这一目标?
I cannot merge the dataframes since there are no common keys and since I need the count of rows based on conditions, I cannot join the dataframes on any temporary columns as I need to avoid cross-join.我无法合并数据框,因为没有通用键,并且由于我需要根据条件计算行数,因此我无法在任何临时列上连接数据框,因为我需要避免交叉连接。
You can do it using a cartesian join and filtering if your dataframes are too big:如果您的数据框太大,您可以使用笛卡尔连接和过滤来完成:
(df1.assign(key=1)
.merge(df2.query('EmploymentType == "Employee" and EmpStatus=="Active"').assign(key=1),
on='key')
.query('EmpStartDate <= StartDate')
.groupby('StartDate')['key'].count())
Output:输出:
StartDate
2013-01-01 2
2013-02-01 2
2014-04-01 4
2014-11-01 4
2015-05-01 4
Name: key, dtype: int64
query
to including EmploymentType and EmpStatus equal to Employee and Active respectively.query
过滤 df2 以包括分别等于 Employee 和 Active 的 EmploymentType 和 EmpStatus。merge
on dummy key to create a cartesian join of all records.merge
来创建所有记录的笛卡尔连接。query
to filter results of join to only those records where EmpStartDate is less than or equal to StartDate.query
将连接结果过滤到仅那些 EmpStartDate 小于或等于 StartDate 的记录。groupby
StartDate and count
.groupby
StartDate 和count
。 Also, note that using query
is a shortcut.另请注意,使用
query
是一种快捷方式。 If your column names contain special character or a space, then you'll need to filter your dataframes using boolean indexing.如果您的列名包含特殊字符或空格,则您需要使用布尔索引来过滤数据框。
pd.merge_asof(df2.query('EmploymentType == "Employee" and EmpStatus == "Active"').sort_values('EmpStartDate'),
df1.sort_values('StartDate'),
left_on='EmpStartDate',
right_on='StartDate',
direction='forward')\
.groupby('StartDate')['EmploymentType'].count()\
.reindex(df1.StartDate.sort_values())\
.cumsum()\
.ffill()
Output:输出:
StartDate
2013-01-01 2.0
2013-02-01 2.0
2014-04-01 4.0
2014-11-01 4.0
2015-05-01 4.0
Name: EmploymentType, dtype: float64
Details:细节:
Use pd.merge_asof
to join df2 filter down to df1 to the nearest forward-looking date.使用
pd.merge_asof
将 df2 过滤器连接到 df1 到最近的前瞻性日期。
groupby
the start date joined on from df1 and count. groupby
从 df1 加入的开始日期并计数。
reindex
results by df.startdate to fill in missing/zero value for the start datesreindex
结果以填充开始日期的缺失值/零值cumsum
to mimic <= functionality and sum.cumsum
来模拟 <= 功能和总和。fillna
to populate missing records with previous sums.fillna
用以前的总和填充缺失的记录。def compensation(x):
return DF2[DF2['EmpStartDate']<x
and DF2['EmpStatus']=='Active'].shape[0]
DF1['Count']=DF1['StartDate']
.apply(lambda x:
compensation(x),axis=1)
The method is Boolean indexing and counting rows.该方法是布尔索引和计数行。 https://pandas.pydata.org/pandas-docs/stable/indexing.html
https://pandas.pydata.org/pandas-docs/stable/indexing.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.