简体   繁体   中英

Compare columns of two dataframes without merging the dataframes

I have two dataframes DF1 and DF2.

DF1:

StartDate

1/1/2013
2/1/2013
11/1/2014
4/1/2014
5/1/2015

DF2:

EmploymentType        EmpStatus           EmpStartDate

Employee              Active              11/5/2012
Employee              Active              9/10/2012
Employee              Active              10/15/2013
Employee              Active              10/29/2013
Employee              Terminated          10/29/2013
Contractor            Terminated          11/20/2014
Contractor            Active              11/20/2014

I want the count of rows from DF2 where EmploymentType = 'Employee' and EmpStatus = 'Active' and EmpStartDate<= Start Date of DF1

Output:

Start Date    Count

1/1/2013      2
2/1/2013      2
11/1/2014     4
4/1/2014      4
5/1/2015      4

How do I achieve this without merging the two dataframes?

I cannot merge the dataframes since there are no common keys and since I need the count of rows based on conditions, I cannot join the dataframes on any temporary columns as I need to avoid cross-join.

You can do it using a cartesian join and filtering if your dataframes are too big:

(df1.assign(key=1)
   .merge(df2.query('EmploymentType == "Employee" and EmpStatus=="Active"').assign(key=1), 
          on='key')
   .query('EmpStartDate <= StartDate')
   .groupby('StartDate')['key'].count())

Output:

StartDate
2013-01-01    2
2013-02-01    2
2014-04-01    4
2014-11-01    4
2015-05-01    4
Name: key, dtype: int64

Details:

  • Filter df2 using query to including EmploymentType and EmpStatus equal to Employee and Active respectively.
  • Assign a dummy key to each dataframe and use merge on dummy key to create a cartesian join of all records.
  • Use query to filter results of join to only those records where EmpStartDate is less than or equal to StartDate.
  • Lastly, groupby StartDate and count .

Also, note that using query is a shortcut. If your column names contain special character or a space, then you'll need to filter your dataframes using boolean indexing.

Option #2:

pd.merge_asof(df2.query('EmploymentType == "Employee" and EmpStatus == "Active"').sort_values('EmpStartDate'), 
              df1.sort_values('StartDate'), 
              left_on='EmpStartDate', 
              right_on='StartDate', 
              direction='forward')\
  .groupby('StartDate')['EmploymentType'].count()\
  .reindex(df1.StartDate.sort_values())\
  .cumsum()\
  .ffill()

Output:

StartDate
2013-01-01    2.0
2013-02-01    2.0
2014-04-01    4.0
2014-11-01    4.0
2015-05-01    4.0
Name: EmploymentType, dtype: float64

Details:

  • Use pd.merge_asof to join df2 filter down to df1 to the nearest forward-looking date.

  • groupby the start date joined on from df1 and count.

  • reindex results by df.startdate to fill in missing/zero value for the start dates
  • Use cumsum to mimic <= functionality and sum.
  • Use fillna to populate missing records with previous sums.

def compensation(x):
return DF2[DF2['EmpStartDate']<x
 and  DF2['EmpStatus']=='Active'].shape[0]

DF1['Count']=DF1['StartDate']
       .apply(lambda x:  
                   compensation(x),axis=1)

The method is Boolean indexing and counting rows. https://pandas.pydata.org/pandas-docs/stable/indexing.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM