简体   繁体   中英

Why does Pandas DataFrame function infer float64 which can be downcast to float32

I am starting a large matrix which I convert to dataframe in pandas allowing pandas to infer the data type of the columns.

The columns are inferred as float64, but I am subsequently able to downcast these columns to float32 using the pandas to_numeric function without a loss of precision.

Why is pandas inefficiently inferring the columns as float64 if they are able to be downcast to float32 without a loss of precision?

a = np.matrix('0.1 0.2; 0.3 0.4')
a_df = pd.DataFrame(list(map(np.ravel, a)), dtype=None)
print(genotype_data_df.dtypes)
# the columns are float64
genotype_data_df = a_df.apply(pd.to_numeric, downcast='float')
# the columns are now float32

I am assuming that there is an underlying technical or practical reason why the library is implemented in this way? If so I am expecting an answer which would explain why this is the case.

Why is pandas inefficiently inferring the columns as int64

It's not clear to me that the cast to int64 is inefficient. This is simply the default dtype for numeric values which avoids redundancy in re-casting the column to a higher precision as would be required by examining every value in the column in order to assign the proper dtype .

Why did they implement it that way instead of say, as integer or float32 ? Because if any value in the column exceeds that default precision, then the entire column needs to be re-cast to a greater precision, and to do that would require examining every single value in the column. So it is less redundant/expensive to just assume the higher precision from the start, rather than examine every value and re-cast etc.

Of course this may not seem "optimal", but this is a tradeoff you have to make if you're not able to specify a dtype for the constructor.

they are able to be downcast to int32 without a loss of precision?

You're mistaken about this. There is apparently no loss of precision, but if you check your genotype_data_df.dtypes , you'll see that they haven't been cast to a lower precision (integer), in fact they remain as float64 .

>> a = np.matrix('0.1 0.2; 0.3 0.4')
>> a_df = DF(list(map(np.ravel, a)), dtype=None)
>> genotype_data_df = a_df.apply(pd.to_numeric, downcast='integer')
>> genotype_data_df.dtypes

0    float64
1    float64
dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM