简体   繁体   中英

How to join pandas dataframe on 2 columns?

Assume the following DataFrames

df1:

id    data1
1     10
2     200
3     3000
4     40000

df2:

id1    id2    data2
1      2      210
1      3      3010
1      4      40010
2      3      3200
2      4      40200
3      4      43000

I want the new df3:

id1    id2    data2    data11    data12        
1      2      210      10        200
1      3      3010     10        3000
1      4      40010    10        40000 
2      3      3200     200       3000
2      4      40200    200       40000
3      4      43000    3000      40000

What is the correct way to achieve this in pandas?


Edit: Please not the specific data can be arbitrary. I chose this specific data just to show where everything comes from, but every data element has no correlation to any other data element.


Other dataframes examples, because the first one wasn't clear enough:

df4:

id    data1
1     a
2     b
3     c
4     d

df5:

id1    id2    data2
1      2      e
1      3      f
1      4      g
2      3      h
2      4      i
3      4      j

I want the new df6:

id1    id2    data2    data11    data12        
1      2      e        a         b  
1      3      f        a         c
1      4      g        a         d
2      3      h        b         c
2      4      i        b         d
3      4      j        c         d

Edit2: Data11 and Data12 are simply a copy of data1 , with the corresponding id id1 or id2

1.First merge both dataframe using id1 and id column
2.rename data1 as data11
3. drop id column
4. Now merge df1 and df3 on id2 and id

df3 = pd.merge(df2,df1,left_on=['id1'],right_on=['id'],how='left')
df3.rename(columns={'data1':'data11'},inplace=True)
df3.drop('id',axis=1,inplace=True)

df3 = pd.merge(d3,df1,left_on=['id2'],right_on=['id'],how='left')
df3.rename(columns={'data1':'data12'},inplace=True)
df3.drop('id',axis=1,inplace=True)

I hope it would solve your problem

Try this:

# merge dataframes, first on id and id1 then on id2
df3 = pd.merge(df1, df2, left_on="id", right_on="id1", how="inner")
df3 = pd.merge(df1, df3, left_on="id", right_on="id2", how="inner")

# rename and reorder columns
cols = [ 'id1', 'id2', 'data2', 'data1_y', 'data1_x']
df3 = df3[cols]

new_cols = ["id1", "id2", "data2", "data11", "data12"]
df3.columns = new_cols

df3.sort_values("id1", inplace=True)

print(df3)

This prints out:

    id1 id2 data2   data11  data12
0   1   2   210     10      200
1   1   3   3010    10      3000
2   1   4   40010   10      40000
3   2   3   3200    200     3000
4   2   4   40200   200     40000
5   3   4   43000   3000    40000

one of the solution to your problem is:

data1 = {'id' : [1,2,3,4],
         'data1' : [10,200,3000,40000]}

data2 = {'id1' : [1,1,1,2,2,3],
         'id2' : [2,3,4,3,4,4],
         'data2' : [210,3010,40010,3200,40200,43000]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

df1:
id    data1
1     10
2     200
3     3000
4     40000

df2:
id1    id2    data2
1      2      210
1      3      3010
1      4      40010
2      3      3200
2      4      40200
3      4      43000

df3 = df2.set_index('id1').join(df1.set_index('id'))
df3.index.names = ['id1']
df3.reset_index(inplace=True)

final = df3.set_index('id2').join(df1.set_index('id'), rsuffix='2')
final.index.names = ['id2']
final.reset_index(inplace=True)

final[['id1','id2','data2','data1','data12']].sort_values('id1')

output df: 

id1 id2 data2   data1   data12
 1   2    210    10     200
 1   3    3010   10     3000
 1   4    40010  10     40000
 2   3    3200   200    3000
 2   4    40200  200    40000
 3   4    43000  3000   40000

I hope this will help you.

Using merge in a for loop with range and f-string

One way we can generalise this and to make it more easily expandable when having more than two dataframes, is to use list comprehension and a for loop with range .

After that we drop the duplicate column names:

dfs = [df2.merge(df1, 
                 left_on=f'id{x+1}', 
                 right_on='id', 
                 how='left').rename(columns={'data1':f'data1{x+1}'}) for x in range(2)]

df = pd.concat(dfs, axis=1).drop('id', axis=1)

df = df.loc[:, ~df.columns.duplicated()]

Output

   id1  id2  data2  data11  data12
0    1    2    210      10     200
1    1    3   3010      10    3000
2    1    4  40010      10   40000
3    2    3   3200     200    3000
4    2    4  40200     200   40000
5    3    4  43000    3000   40000

As @tawab_shakeel has mentioned earlier, your primary step is to merge the Dataframes on a particular column based on certain (SQL) join rules; just for you to understand the different approaches to merging on specific column(s), here is a general guide. 在此输入图像描述

Joining Dataframes in Pandas

在此输入图像描述

SQL Join Types

use two left hand merges on column id1 and id2 for dataframe df2

txt="""id,data1 1,a 2,b 3,c 4,d """

from io import StringIO
f = StringIO(txt)
df1 = pd.read_table(f,sep =',')
df1['id']=df1['id'].astype(int)

txt="""id1,id2,data2
1,2,e
1,3,f
1,4,g
2,3,h
2,4,i
3,4,j
"""

f = StringIO(txt)
df2 = pd.read_table(f,sep =',')
df2['id1']=df2['id1'].astype(int)
df2['id2']=df2['id2'].astype(int)

left_on='id1'
right_on='id'
suffix='_1'
df2=df2.merge(df1, how='left', left_on=left_on, right_on=right_on, 
                  suffixes=("", suffix))

left_on='id2'
right_on='id'
suffix='_2'
df2=df2.merge(df1, how='left', left_on=left_on, right_on=right_on, 
                  suffixes=("", suffix))

print(df2)

output

   id1  id2 data2  id data1  id_2 data1_2
0    1    2     e   1     a     2       b
1    1    3     f   1     a     3       c
2    1    4     g   1     a     4       d
3    2    3     h   2     b     3       c
4    2    4     i   2     b     4       d
5    3    4     j   3     c     4       d

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM