比較兩個數據幀的列而不合並數據幀

Question

我有兩個數據幀 DF1 和 DF2。

DF1：

StartDate

1/1/2013
2/1/2013
11/1/2014
4/1/2014
5/1/2015

DF2：

EmploymentType        EmpStatus           EmpStartDate

Employee              Active              11/5/2012
Employee              Active              9/10/2012
Employee              Active              10/15/2013
Employee              Active              10/29/2013
Employee              Terminated          10/29/2013
Contractor            Terminated          11/20/2014
Contractor            Active              11/20/2014

我想要來自 DF2 的行數，其中 EmploymentType = 'Employee' 和 EmpStatus = 'Active' 和 EmpStartDate<= DF1 的開始日期

輸出：

Start Date    Count

1/1/2013      2
2/1/2013      2
11/1/2014     4
4/1/2014      4
5/1/2015      4

如何在不合並兩個數據幀的情況下實現這一目標？

我無法合並數據框，因為沒有通用鍵，並且由於我需要根據條件計算行數，因此我無法在任何臨時列上連接數據框，因為我需要避免交叉連接。

Answer 1

如果您的數據框太大，您可以使用笛卡爾連接和過濾來完成：

(df1.assign(key=1)
   .merge(df2.query('EmploymentType == "Employee" and EmpStatus=="Active"').assign(key=1), 
          on='key')
   .query('EmpStartDate <= StartDate')
   .groupby('StartDate')['key'].count())

輸出：

StartDate
2013-01-01    2
2013-02-01    2
2014-04-01    4
2014-11-01    4
2015-05-01    4
Name: key, dtype: int64

細節：

使用query過濾 df2 以包括分別等於 Employee 和 Active 的 EmploymentType 和 EmpStatus。
為每個數據框分配一個虛擬鍵，並在虛擬鍵上使用merge來創建所有記錄的笛卡爾連接。
使用query將連接結果過濾到僅那些 EmpStartDate 小於或等於 StartDate 的記錄。
最后， groupby StartDate 和count 。

另請注意，使用query是一種快捷方式。 如果您的列名包含特殊字符或空格，則您需要使用布爾索引來過濾數據框。

選項#2：

pd.merge_asof(df2.query('EmploymentType == "Employee" and EmpStatus == "Active"').sort_values('EmpStartDate'), 
              df1.sort_values('StartDate'), 
              left_on='EmpStartDate', 
              right_on='StartDate', 
              direction='forward')\
  .groupby('StartDate')['EmploymentType'].count()\
  .reindex(df1.StartDate.sort_values())\
  .cumsum()\
  .ffill()

輸出：

StartDate
2013-01-01    2.0
2013-02-01    2.0
2014-04-01    4.0
2014-11-01    4.0
2015-05-01    4.0
Name: EmploymentType, dtype: float64

細節：

使用pd.merge_asof將 df2 過濾器連接到 df1 到最近的前瞻性日期。
groupby從 df1 加入的開始日期並計數。
通過 df.startdate 重新reindex結果以填充開始日期的缺失值/零值
使用cumsum來模擬 <= 功能和總和。
使用fillna用以前的總和填充缺失的記錄。

Answer 2

def compensation(x):
return DF2[DF2['EmpStartDate']<x
 and  DF2['EmpStatus']=='Active'].shape[0]

DF1['Count']=DF1['StartDate']
       .apply(lambda x:  
                   compensation(x),axis=1)

該方法是布爾索引和計數行。 https://pandas.pydata.org/pandas-docs/stable/indexing.html

比較兩個數據幀的列而不合並數據幀

問題描述

2 個解決方案

解決方案1
1 已采納 2018-10-24 18:32:13

細節：

選項#2：

解決方案2
0

比較兩個數據幀的列而不合並數據幀

問題描述

2 個解決方案

解決方案1 1 已采納 2018-10-24 18:32:13

細節：

選項#2：

解決方案2 0

解決方案1
1 已采納 2018-10-24 18:32:13

解決方案2
0