比较两个数据帧的列而不合并数据帧

Question

I have two dataframes DF1 and DF2.我有两个数据帧 DF1 和 DF2。

DF1: DF1：

StartDate

1/1/2013
2/1/2013
11/1/2014
4/1/2014
5/1/2015

DF2: DF2：

EmploymentType        EmpStatus           EmpStartDate

Employee              Active              11/5/2012
Employee              Active              9/10/2012
Employee              Active              10/15/2013
Employee              Active              10/29/2013
Employee              Terminated          10/29/2013
Contractor            Terminated          11/20/2014
Contractor            Active              11/20/2014

I want the count of rows from DF2 where EmploymentType = 'Employee' and EmpStatus = 'Active' and EmpStartDate<= Start Date of DF1我想要来自 DF2 的行数，其中 EmploymentType = 'Employee' 和 EmpStatus = 'Active' 和 EmpStartDate<= DF1 的开始日期

Output:输出：

Start Date    Count

1/1/2013      2
2/1/2013      2
11/1/2014     4
4/1/2014      4
5/1/2015      4

How do I achieve this without merging the two dataframes?如何在不合并两个数据帧的情况下实现这一目标？

I cannot merge the dataframes since there are no common keys and since I need the count of rows based on conditions, I cannot join the dataframes on any temporary columns as I need to avoid cross-join.我无法合并数据框，因为没有通用键，并且由于我需要根据条件计算行数，因此我无法在任何临时列上连接数据框，因为我需要避免交叉连接。

Answer 1

You can do it using a cartesian join and filtering if your dataframes are too big:如果您的数据框太大，您可以使用笛卡尔连接和过滤来完成：

(df1.assign(key=1)
   .merge(df2.query('EmploymentType == "Employee" and EmpStatus=="Active"').assign(key=1), 
          on='key')
   .query('EmpStartDate <= StartDate')
   .groupby('StartDate')['key'].count())

Output:输出：

StartDate
2013-01-01    2
2013-02-01    2
2014-04-01    4
2014-11-01    4
2015-05-01    4
Name: key, dtype: int64

Details:细节：

Filter df2 using query to including EmploymentType and EmpStatus equal to Employee and Active respectively.使用query过滤 df2 以包括分别等于 Employee 和 Active 的 EmploymentType 和 EmpStatus。
Assign a dummy key to each dataframe and use merge on dummy key to create a cartesian join of all records.为每个数据框分配一个虚拟键，并在虚拟键上使用merge来创建所有记录的笛卡尔连接。
Use query to filter results of join to only those records where EmpStartDate is less than or equal to StartDate.使用query将连接结果过滤到仅那些 EmpStartDate 小于或等于 StartDate 的记录。
Lastly, groupby StartDate and count .最后， groupby StartDate 和count 。

Also, note that using query is a shortcut.另请注意，使用query是一种快捷方式。 If your column names contain special character or a space, then you'll need to filter your dataframes using boolean indexing.如果您的列名包含特殊字符或空格，则您需要使用布尔索引来过滤数据框。

Option #2:选项#2：

pd.merge_asof(df2.query('EmploymentType == "Employee" and EmpStatus == "Active"').sort_values('EmpStartDate'), 
              df1.sort_values('StartDate'), 
              left_on='EmpStartDate', 
              right_on='StartDate', 
              direction='forward')\
  .groupby('StartDate')['EmploymentType'].count()\
  .reindex(df1.StartDate.sort_values())\
  .cumsum()\
  .ffill()

Output:输出：

StartDate
2013-01-01    2.0
2013-02-01    2.0
2014-04-01    4.0
2014-11-01    4.0
2015-05-01    4.0
Name: EmploymentType, dtype: float64

Details:细节：

Use pd.merge_asof to join df2 filter down to df1 to the nearest forward-looking date.使用pd.merge_asof将 df2 过滤器连接到 df1 到最近的前瞻性日期。
groupby the start date joined on from df1 and count. groupby从 df1 加入的开始日期并计数。
reindex results by df.startdate to fill in missing/zero value for the start dates通过 df.startdate 重新reindex结果以填充开始日期的缺失值/零值
Use cumsum to mimic <= functionality and sum.使用cumsum来模拟 <= 功能和总和。
Use fillna to populate missing records with previous sums.使用fillna用以前的总和填充缺失的记录。

Answer 2

def compensation(x):
return DF2[DF2['EmpStartDate']<x
 and  DF2['EmpStatus']=='Active'].shape[0]

DF1['Count']=DF1['StartDate']
       .apply(lambda x:  
                   compensation(x),axis=1)

The method is Boolean indexing and counting rows.该方法是布尔索引和计数行。 https://pandas.pydata.org/pandas-docs/stable/indexing.html https://pandas.pydata.org/pandas-docs/stable/indexing.html

比较两个数据帧的列而不合并数据帧

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-10-24 18:32:13

Details:细节：

Option #2:选项#2：

解决方案2
0

比较两个数据帧的列而不合并数据帧

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-10-24 18:32:13

Details:细节：

Option #2:选项#2：

解决方案2 0

解决方案1
1 已采纳 2018-10-24 18:32:13

解决方案2
0