Explain how pandas DataFrame join works

Question

Why does inner join work so strange in pandas?

For example:

import pandas as pd
import io

t1 = ('key,col1\n'
      '1,a\n'
      '2,b\n'
      '3,c\n'
      '4,d')

t2 = ('key,col2\n'
      '1,e\n'
      '2,f\n'
      '3,g\n'
      '4,h')


df1 = pd.read_csv(io.StringIO(t1), header=0)
df2 = pd.read_csv(io.StringIO(t2), header=0)

print(df1)
print()
print(df2)
print()
print(df2.join(df1, on='key', how='inner', lsuffix='_l'))

Outputs:

   key col1
0    1    a
1    2    b
2    3    c
3    4    d

   key col2
0    1    e
1    2    f
2    3    g
3    4    h

   key_l col2  key col1
0      1    e    2    b
1      2    f    3    c
2      3    g    4    d

If I don't specify lsuffix , it says

ValueError: columns overlap but no suffix specified: Index(['key'], dtype='object')

Does this function work differently from SQL's JOIN? Why does it want to create an extra 'key' column with a suffix? Why are there only 3 rows? I expected it to output something like this:

   key col1 col2
0    1    a    e
1    2    b    f
2    3    c    g
3    4    d    h

Answer 1

First things first:
What you wanted was merge

df1.merge(df2)

join defaults to merging on the index . You can specify the on parameter which only says which column from left side to match with the index of the right side.

These might help illustrate

df1.set_index('key').join(df2.set_index('key'))

df1.join(df2.set_index('key'), on='key')

Your example is matching the index of df2 which looks like [0, 1, 2, 3] with the key column of df1 which looks like [1, 2, 3, 4]
That's why you get NaN in col2 when key_l is 4

df1.join(df2, on='key', lsuffix='_l', how='outer')

Explain how pandas DataFrame join works

Question

1 answers

solution1
2 ACCPTED 2016-09-28 19:27:47

Explain how pandas DataFrame join works

Question

1 answers

solution1 2 ACCPTED 2016-09-28 19:27:47

solution1
2 ACCPTED 2016-09-28 19:27:47