简体   繁体   English

pandas to_numeric(..., downcast='float') 失去精度

[英]pandas to_numeric(…, downcast='float') losing precision

Downcasting pandas dataframe (by columns) from float64 to float32 results in losing precision even though largest(9.761140e+02) and smallest (0.000000e+00) element is suitable for float32.将 pandas dataframe(按列)从 float64 向下转换为 float32 会导致精度下降,即使最大(9.761140e+02)和最小(0.000000e+00)元素适用于 float32。

Dataset is pretty large, 55 million rows times 12 columns.数据集非常大,5500 万行乘以 12 列。 This is the mean of the particular column without downcasting (1.343987e+00) and after is this 1.224472e+00.这是没有向下转换的特定列的平均值 (1.343987e+00),之后是 1.224472e+00。

Same results I am getting with np.astype() .我用np.astype()得到的结果相同。

This was a pretty interesting question.这是一个非常有趣的问题。 I tested several dataframes starting from 1 million records to 55 million, the same size as yours, keeping min , max value similar to the ones you have.我测试了几个数据帧,从 100 万条记录到 5500 万条记录,大小与您的相同,保持minmax值与您所拥有的相似。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

x, y = [], []
for idx, num in enumerate(range(1, 57, 2)):
    print(f"{idx+1}) Testing with {num} million records...")
    rows = num*(10**6)
    cols = ['col']

    df = pd.DataFrame(np.random.uniform(0, 9.761140e+02, size=(rows, len(cols))), columns=cols)
    df['col1'] = pd.to_numeric(df['col'], downcast='float')
    df['diff'] = df['col'] - df['col1']

    diff = df['col'].mean() - df['col1'].mean()

    x.append(num)
    y.append(diff)

plt.plot(x, y, 'ro')
plt.xlabel('number of rows (millions)')
plt.ylabel('precision value lost')
plt.show()

Here's the plot.这是 plot。 在此处输入图像描述

Based on the plot, it seems like, after 35 million records, there is a sudden increase in loss of precision and appears to be logarithmic in nature.基于 plot,似乎在 3500 万条记录之后,精度损失突然增加,并且本质上似乎是对数的。 I haven't figured out yet why it is the way it is.我还没弄清楚为什么会这样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM