简体   繁体   中英

Explain how pandas DataFrame join works

Why does inner join work so strange in pandas?

For example:

import pandas as pd
import io

t1 = ('key,col1\n'
      '1,a\n'
      '2,b\n'
      '3,c\n'
      '4,d')

t2 = ('key,col2\n'
      '1,e\n'
      '2,f\n'
      '3,g\n'
      '4,h')


df1 = pd.read_csv(io.StringIO(t1), header=0)
df2 = pd.read_csv(io.StringIO(t2), header=0)

print(df1)
print()
print(df2)
print()
print(df2.join(df1, on='key', how='inner', lsuffix='_l'))

Outputs:

   key col1
0    1    a
1    2    b
2    3    c
3    4    d

   key col2
0    1    e
1    2    f
2    3    g
3    4    h

   key_l col2  key col1
0      1    e    2    b
1      2    f    3    c
2      3    g    4    d

If I don't specify lsuffix , it says

ValueError: columns overlap but no suffix specified: Index(['key'], dtype='object')

Does this function work differently from SQL's JOIN? Why does it want to create an extra 'key' column with a suffix? Why are there only 3 rows? I expected it to output something like this:

   key col1 col2
0    1    a    e
1    2    b    f
2    3    c    g
3    4    d    h

First things first:
What you wanted was merge

df1.merge(df2)

在此处输入图片说明


join defaults to merging on the index . You can specify the on parameter which only says which column from left side to match with the index of the right side.

These might help illustrate

df1.set_index('key').join(df2.set_index('key'))

在此处输入图片说明

df1.join(df2.set_index('key'), on='key')

在此处输入图片说明


Your example is matching the index of df2 which looks like [0, 1, 2, 3] with the key column of df1 which looks like [1, 2, 3, 4]
That's why you get NaN in col2 when key_l is 4

df1.join(df2, on='key', lsuffix='_l', how='outer')

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM