I recently asked a question regarding missing values in pandas here and was directed to a github issue . After reading through that page and the missing data documentation .
I am wondering why merge
and join
treat NaNs as a match when "they don't compare equal": np.nan != np.nan
# merge example
df = pd.DataFrame({'col1':[np.nan, 'match'], 'col2':[1,2]})
df2 = pd.DataFrame({'col1':[np.nan, 'no match'], 'col3':[3,4]})
pd.merge(df,df2, on='col1')
col1 col2 col3
0 NaN 1 3
# join example with same dataframes from above
df.set_index('col1').join(df2.set_index('col1'))
col2 col3
col1
NaN 1 3.0
match 2 NaN
However, NaNs in groupby
are excluded:
df = pd.DataFrame({'col1':[np.nan, 'match', np.nan], 'col2':[1,2,1]})
df.groupby('col1').sum()
col2
col1
match 2
Of course you can dropna()
or df[df['col1'].notnull()]
but I am curious as to why NaNs are excluded in some pandas operations like groupby
and not others like merge
, join
, update
, and map
?
Essentially, as I asked above, why does merge
and join
match on np.nan
when they do not compare equal?
Yeah, this is definitely a bug. See GH22491 which documents exactly your issue, and GH22618 which notes the problem is also observed with None
. based on the discussions, this does not appear to be intended behaviour.
A quick source dive shows that the issue * might * be inside the _factorize_keys
function in pandas/core/reshape/merge.py
. This function appears to factorise the keys to determine what rows are to be matched with each other.
Specifically, this portion
# NA group
lmask = llab == -1
lany = lmask.any()
rmask = rlab == -1
rany = rmask.any()
if lany or rany:
if lany:
np.putmask(llab, lmask, count)
if rany:
np.putmask(rlab, rmask, count)
count += 1
...seems to be the culprit. NaN keys are identified as a valid category (with categorical value equal to count
).
Disclaimer: I am not a pandas dev, and this is only my speculation; so the real issue could be something else. But from first glance, this seems like it.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.