Compare two pandas dataframe with different size

Question

I have one massive pandas dataframe with this structure:

And a second one, smaller like this:

I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G

I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.

Many thanks,

Boris

Answer 1

If you only want to match mutual rows in both dataframes:

import pandas as pd

df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1    
   Name Special ability
0  Sara   Walk on water

df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
     Name  Age
0    Sara    4
1  Gustaf   12
2  Patrik   11

df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
     Name  Age Special ability
0    Sara    4             NaN
1  Gustaf   12   Walk on water
2  Patrik   11             NaN

This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)

df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})

df1
     Name Special ability  Age
0    Sara   Walk on water   12
1  Patrik       FireBalls   83

df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
     Name  Age
0    Sara    4
1  Gustaf   12
2  Patrik   11

df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
     Name  Age Special ability
0    Sara   12   Walk on water
1  Gustaf   12             NaN
2  Patrik   11             NaN

Answer 2

You probably want to use a merge:

df=df1.merge(df2,left_on="A",right_on="G")

will give you a dataframe with 3 columns, but the third one's name will be H

df.columns=["A","B","C"]

will then give you the column names you want

Answer 3

You can use map by Series created by set_index :

df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
    A   B   C
0   0  12  15
1   0  15  15
2   0  17  15
3   0  18  15
4   1  45  45
5   1  78  45
6   1  96  45
7   1  32  45
8   2  45  31
9   2  78  31
10  2  44  31
11  2  10  31

Or merge with drop and rename :

df = df1.merge(df2,left_on="A",right_on="G", how='left')
        .drop('G', axis=1)
        .rename(columns={'H':'C'})
print (df)
    A   B   C
0   0  12  15
1   0  15  15
2   0  17  15
3   0  18  15
4   1  45  45
5   1  78  45
6   1  96  45
7   1  32  45
8   2  45  31
9   2  78  31
10  2  44  31
11  2  10  31

Answer 4

Here's one vectorized NumPy approach -

idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]

idx could be computed in a simpler way with : df2.G.searchsorted(df1.A) , but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

Compare two pandas dataframe with different size

Question

4 answers

solution1
4 ACCPTED 2020-08-06 07:41:52

solution2
2 ACCPTED 2017-06-07 14:05:02

solution3
0 2017-06-07 14:06:20

solution4
0 2017-06-07 14:10:55

Compare two pandas dataframe with different size

Question

4 answers

solution1 4 ACCPTED 2020-08-06 07:41:52

solution2 2 ACCPTED 2017-06-07 14:05:02

solution3 0 2017-06-07 14:06:20

solution4 0 2017-06-07 14:10:55

solution1
4 ACCPTED 2020-08-06 07:41:52

solution2
2 ACCPTED 2017-06-07 14:05:02

solution3
0 2017-06-07 14:06:20

solution4
0 2017-06-07 14:10:55