简体   繁体   中英

Pandas: DataFrame too long after merge

Say I have to DataFrames, one longer than the other, that I want to join on a specific column, as in the following example:

A = pd.DataFrame({'col1': [1, 2, 3, 4, 5], 'col2': [6, 7, 8, 9, 10], 'col3': [11, 12, 13, 14, 15]})

B = pd.DataFrame({'col1': [1, 3, 5], 'col2': [16, 17, 18], 'col4': [19, 20, 21]})

Then I join them with:

pd.merge(A, B, on='col1', how='outer')

And get, as expected:

       col1     col2_x  col3    col2_y  col4
0       1       6       11      16      19
1       2       7       12      NaN     NaN
2       3       8       13      17      20
3       4       9       14      NaN     NaN
4       5       10      15      18      21

5 rows × 5 columns

However, I have two DataFrames that I'm trying to merge, with 28,011 and 15,676 rows, respectively. Merging them the same way as above, I would expect to get back a DataFrame with 28,011 rows and NaN in those cells where df2 had no observations. What happens instead is this:

len(pd.merge(df1, df2, on='col1', how='outer'))
  51881

How is this possible? The column I'm merging on is a unique identifier, and the same operation goes through without problems in Stata. What am I missing here?

Sounds like you want a left join.

Try:

pd.merge(df1, df2, left_on='col1',right_on='col1',how='left')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM