Merge pandas Data Frames based on conditions

Question

I have two files which show information about a transaction over products

Operations of type 1

d_op_1 = pd.DataFrame({'id':[1,1,1,2,2,2,3,3],'cost':[10,20,20,20,10,20,20,20],
                       'date':[2000,2006,2012,2000,2009,2009,2002,2006]})

Operations of type 2

d_op_2 = pd.DataFrame({'id':[1,1,2,2,3,4,5,5],'cost':[3000,3100,3200,4000,4200,3400,2000,2500],
                       'date':[2010,2015,2008,2010,2006,2010,1990,2000]})

I want to keep only those registers were there have been operations of type one between two operations of type 2. EG for the product wit the id "1" there was an operation of type 1 (2012) between two operations of type 2 (2010,2015) so I want to keep that record.

The desired output it cloud be either this:

or this:

Using pd.merge() I got this result:

How can I filter this to get the desired output?

Answer 1

You can use:

#concat DataFrames together             
df4 = pd.concat([d_op_1.rename(columns={'cost':'cost1'}), 
                 d_op_2.rename(columns={'cost':'cost2'})]).fillna(0).astype(int)

#print (df4)

#find max and min dates per goups
df3 = d_op_2.groupby('id')['date'].agg({'start':'min','end':'max'}) 
#print (df3)

#join max and min dates to concated df
df = df4.join(df3, on='id')
df = df[(df.date > df.start) & (df.date < df.end)]
#reshape df for min, max and dated between them
df = pd.melt(df, 
             id_vars=['id','cost1'], 
             value_vars=['date','start','end'], 
             value_name='date')
#remove columns
df = df.drop(['cost1','variable'], axis=1) \
       .drop_duplicates()
#merge to original, sorting
df = pd.merge(df, df4, on=['id', 'date']) \
       .sort_values(['id','date']).reset_index(drop=True)
#reorder columns
df = df[['id','cost1','cost2','date']]

print (df)
   id  cost1  cost2  date
0   1      0   3000  2010
1   1     20      0  2012
2   1      0   3100  2015
3   2      0   3200  2008
4   2     10      0  2009
5   2     20      0  2009
6   2      0   4000  2010

#if need lists for duplicates
df = df.groupby(['id','cost2', 'date'])['cost1'] \
       .apply(lambda x: list(x) if len(x) > 1 else x.values[0]) \
       .reset_index()
df = df[['id','cost1','cost2','date']]
print (df)
   id     cost1  cost2  date
0   1        20      0  2012
1   1         0   3000  2010
2   1         0   3100  2015
3   2  [10, 20]      0  2009
4   2         0   3200  2008
5   2         0   4000  2010

Merge pandas Data Frames based on conditions

Question

1 answers

solution1
3 ACCPTED 2017-01-22 22:45:49

Merge pandas Data Frames based on conditions

Question

1 answers

solution1 3 ACCPTED 2017-01-22 22:45:49

solution1
3 ACCPTED 2017-01-22 22:45:49