简体   繁体   中英

Why does pandas merge on NaN?

I recently asked a question regarding missing values in pandas here and was directed to a github issue . After reading through that page and the missing data documentation .

I am wondering why merge and join treat NaNs as a match when "they don't compare equal": np.nan != np.nan

# merge example
df = pd.DataFrame({'col1':[np.nan, 'match'], 'col2':[1,2]})
df2 = pd.DataFrame({'col1':[np.nan, 'no match'], 'col3':[3,4]})
pd.merge(df,df2, on='col1')

    col1    col2    col3
0   NaN      1       3

# join example with same dataframes from above
df.set_index('col1').join(df2.set_index('col1'))

      col2  col3
col1        
NaN     1   3.0
match   2   NaN

However, NaNs in groupby are excluded:

df = pd.DataFrame({'col1':[np.nan, 'match', np.nan], 'col2':[1,2,1]})
df.groupby('col1').sum()

       col2
col1    
match   2

Of course you can dropna() or df[df['col1'].notnull()] but I am curious as to why NaNs are excluded in some pandas operations like groupby and not others like merge , join , update , and map ?

Essentially, as I asked above, why does merge and join match on np.nan when they do not compare equal?

Yeah, this is definitely a bug. See GH22491 which documents exactly your issue, and GH22618 which notes the problem is also observed with None . based on the discussions, this does not appear to be intended behaviour.

A quick source dive shows that the issue * might * be inside the _factorize_keys function in pandas/core/reshape/merge.py . This function appears to factorise the keys to determine what rows are to be matched with each other.

Specifically, this portion

# NA group
lmask = llab == -1
lany = lmask.any()
rmask = rlab == -1
rany = rmask.any()

if lany or rany:
    if lany:
        np.putmask(llab, lmask, count)
    if rany:
        np.putmask(rlab, rmask, count)
    count += 1

...seems to be the culprit. NaN keys are identified as a valid category (with categorical value equal to count ).

Disclaimer: I am not a pandas dev, and this is only my speculation; so the real issue could be something else. But from first glance, this seems like it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM