简体   繁体   中英

pandas DataFrame combine_first method converts boolean in floats

I'm running into a strange issue where combine_first method is causing values stored as bool to be upcasted into float64s. Example:

In [1]: import pandas as pd

In [2]: df1 = pd.DataFrame({"a": [True]})

In [3]: df2 = pd.DataFrame({"b": ['test']})

In [4]: df2.combine_first(df1)
Out[4]:
     a     b
0  1.0  test

This problem has already been reported in a previous post 3 years ago: pandas DataFrame combine_first and update methods have strange behavior . This issue was told to be solved but I still have this behaviour under pandas 0.18.1

thank you for your help

Somewhere along the chain of events to get to a combined dataframe, potential missing values had to be addressed. I'm aware that nothing is missing in your example. None and np.nan are not int , or bool . So in order to have a common dtype that contains a bool and a None or np.nan it is necessary to cast the column as either object or float . As 'float`, a large number of operations become far more efficient and is a decent choice. It obviously isn't the best choice all of the time, but a choice has to be made none the less and pandas tried to infer the best one.

A work around:

Setup

df1 = pd.DataFrame({"a": [True]})
df2 = pd.DataFrame({"b": ['test']})

df3 = df2.combine_first(df1)
df3

在此输入图像描述

Solution

dtypes = df1.dtypes.combine_first(df2.dtypes)

for k, v in dtypes.iteritems():
    df3[k] = df3[k].astype(v)

df3

在此输入图像描述

I ran into the same issue. This specific case does not seem to be fixed in Pandas yet. I've filed a bug report:

https://github.com/pandas-dev/pandas/issues/20699

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM