简体   繁体   English

比较两个数据帧的列而不合并数据帧

[英]Compare columns of two dataframes without merging the dataframes

I have two dataframes DF1 and DF2.我有两个数据帧 DF1 和 DF2。

DF1: DF1:

StartDate

1/1/2013
2/1/2013
11/1/2014
4/1/2014
5/1/2015

DF2: DF2:

EmploymentType        EmpStatus           EmpStartDate

Employee              Active              11/5/2012
Employee              Active              9/10/2012
Employee              Active              10/15/2013
Employee              Active              10/29/2013
Employee              Terminated          10/29/2013
Contractor            Terminated          11/20/2014
Contractor            Active              11/20/2014

I want the count of rows from DF2 where EmploymentType = 'Employee' and EmpStatus = 'Active' and EmpStartDate<= Start Date of DF1我想要来自 DF2 的行数,其中 EmploymentType = 'Employee' 和 EmpStatus = 'Active' 和 EmpStartDate<= DF1 的开始日期

Output:输出:

Start Date    Count

1/1/2013      2
2/1/2013      2
11/1/2014     4
4/1/2014      4
5/1/2015      4

How do I achieve this without merging the two dataframes?如何在不合并两个数据帧的情况下实现这一目标?

I cannot merge the dataframes since there are no common keys and since I need the count of rows based on conditions, I cannot join the dataframes on any temporary columns as I need to avoid cross-join.我无法合并数据框,因为没有通用键,并且由于我需要根据条件计算行数,因此我无法在任何临时列上连接数据框,因为我需要避免交叉连接。

You can do it using a cartesian join and filtering if your dataframes are too big:如果您的数据框太大,您可以使用笛卡尔连接和过滤来完成:

(df1.assign(key=1)
   .merge(df2.query('EmploymentType == "Employee" and EmpStatus=="Active"').assign(key=1), 
          on='key')
   .query('EmpStartDate <= StartDate')
   .groupby('StartDate')['key'].count())

Output:输出:

StartDate
2013-01-01    2
2013-02-01    2
2014-04-01    4
2014-11-01    4
2015-05-01    4
Name: key, dtype: int64

Details:细节:

  • Filter df2 using query to including EmploymentType and EmpStatus equal to Employee and Active respectively.使用query过滤 df2 以包括分别等于 Employee 和 Active 的 EmploymentType 和 EmpStatus。
  • Assign a dummy key to each dataframe and use merge on dummy key to create a cartesian join of all records.为每个数据框分配一个虚拟键,并在虚拟键上使用merge来创建所有记录的笛卡尔连接。
  • Use query to filter results of join to only those records where EmpStartDate is less than or equal to StartDate.使用query将连接结果过滤到仅那些 EmpStartDate 小于或等于 StartDate 的记录。
  • Lastly, groupby StartDate and count .最后, groupby StartDate 和count

Also, note that using query is a shortcut.另请注意,使用query是一种快捷方式。 If your column names contain special character or a space, then you'll need to filter your dataframes using boolean indexing.如果您的列名包含特殊字符或空格,则您需要使用布尔索引来过滤数据框。

Option #2:选项#2:

pd.merge_asof(df2.query('EmploymentType == "Employee" and EmpStatus == "Active"').sort_values('EmpStartDate'), 
              df1.sort_values('StartDate'), 
              left_on='EmpStartDate', 
              right_on='StartDate', 
              direction='forward')\
  .groupby('StartDate')['EmploymentType'].count()\
  .reindex(df1.StartDate.sort_values())\
  .cumsum()\
  .ffill()

Output:输出:

StartDate
2013-01-01    2.0
2013-02-01    2.0
2014-04-01    4.0
2014-11-01    4.0
2015-05-01    4.0
Name: EmploymentType, dtype: float64

Details:细节:

  • Use pd.merge_asof to join df2 filter down to df1 to the nearest forward-looking date.使用pd.merge_asof将 df2 过滤器连接到 df1 到最近的前瞻性日期。

  • groupby the start date joined on from df1 and count. groupby从 df1 加入的开始日期并计数。

  • reindex results by df.startdate to fill in missing/zero value for the start dates通过 df.startdate 重新reindex结果以填充开始日期的缺失值/零值
  • Use cumsum to mimic <= functionality and sum.使用cumsum来模拟 <= 功能和总和。
  • Use fillna to populate missing records with previous sums.使用fillna用以前的总和填充缺失的记录。

def compensation(x):
return DF2[DF2['EmpStartDate']<x
 and  DF2['EmpStatus']=='Active'].shape[0]

DF1['Count']=DF1['StartDate']
       .apply(lambda x:  
                   compensation(x),axis=1)

The method is Boolean indexing and counting rows.该方法是布尔索引和计数行。 https://pandas.pydata.org/pandas-docs/stable/indexing.html https://pandas.pydata.org/pandas-docs/stable/indexing.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM