I'm running into a strange issue where combine_first method is causing values stored as bool to be upcasted into float64s. Example:
In [1]: import pandas as pd
In [2]: df1 = pd.DataFrame({"a": [True]})
In [3]: df2 = pd.DataFrame({"b": ['test']})
In [4]: df2.combine_first(df1)
Out[4]:
a b
0 1.0 test
This problem has already been reported in a previous post 3 years ago: pandas DataFrame combine_first and update methods have strange behavior . This issue was told to be solved but I still have this behaviour under pandas 0.18.1
thank you for your help
Somewhere along the chain of events to get to a combined dataframe, potential missing values had to be addressed. I'm aware that nothing is missing in your example. None
and np.nan
are not int
, or bool
. So in order to have a common dtype
that contains a bool
and a None
or np.nan
it is necessary to cast the column as either object
or float
. As 'float`, a large number of operations become far more efficient and is a decent choice. It obviously isn't the best choice all of the time, but a choice has to be made none the less and pandas tried to infer the best one.
A work around:
Setup
df1 = pd.DataFrame({"a": [True]})
df2 = pd.DataFrame({"b": ['test']})
df3 = df2.combine_first(df1)
df3
Solution
dtypes = df1.dtypes.combine_first(df2.dtypes)
for k, v in dtypes.iteritems():
df3[k] = df3[k].astype(v)
df3
I ran into the same issue. This specific case does not seem to be fixed in Pandas yet. I've filed a bug report:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.