[英]pandas to_numeric(…, downcast='float') losing precision
Downcasting pandas dataframe (by columns) from float64 to float32 results in losing precision even though largest(9.761140e+02) and smallest (0.000000e+00) element is suitable for float32.将 pandas dataframe(按列)从 float64 向下转换为 float32 会导致精度下降,即使最大(9.761140e+02)和最小(0.000000e+00)元素适用于 float32。
Dataset is pretty large, 55 million rows times 12 columns.数据集非常大,5500 万行乘以 12 列。 This is the mean of the particular column without downcasting (1.343987e+00) and after is this 1.224472e+00.这是没有向下转换的特定列的平均值 (1.343987e+00),之后是 1.224472e+00。
Same results I am getting with np.astype()
.我用np.astype()
得到的结果相同。
This was a pretty interesting question.这是一个非常有趣的问题。 I tested several dataframes starting from 1 million records to 55 million, the same size as yours, keeping min
, max
value similar to the ones you have.我测试了几个数据帧,从 100 万条记录到 5500 万条记录,大小与您的相同,保持min
, max
值与您所拥有的相似。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x, y = [], []
for idx, num in enumerate(range(1, 57, 2)):
print(f"{idx+1}) Testing with {num} million records...")
rows = num*(10**6)
cols = ['col']
df = pd.DataFrame(np.random.uniform(0, 9.761140e+02, size=(rows, len(cols))), columns=cols)
df['col1'] = pd.to_numeric(df['col'], downcast='float')
df['diff'] = df['col'] - df['col1']
diff = df['col'].mean() - df['col1'].mean()
x.append(num)
y.append(diff)
plt.plot(x, y, 'ro')
plt.xlabel('number of rows (millions)')
plt.ylabel('precision value lost')
plt.show()
Based on the plot, it seems like, after 35 million records, there is a sudden increase in loss of precision and appears to be logarithmic in nature.基于 plot,似乎在 3500 万条记录之后,精度损失突然增加,并且本质上似乎是对数的。 I haven't figured out yet why it is the way it is.我还没弄清楚为什么会这样。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.