![](/img/trans.png)
[英]Converting large Pandas Dataframe from "sparse" float to int
[英]Efficiently convert large Pandas DataFrame columns from float to int
这与在 dataframe 中存在 NaN 时使用 astype 的错误不同,因为我需要保留 NaN 值,因此我选择使用实验性IntegerArray 。 这个问题的症结在于试图避免循环。
我们有许多大型医疗数据集,我从 SAS 导入到 Pandas。 大多数字段是枚举类型,应该表示为整数,但它们以 float64 的形式出现,因为许多字段包含 NaN 值。 Pandas 中的实验 IntegerArray 类型解决了 NaN 问题。 但是,这些数据集非常大,我想根据数据本身将它们转换为脚本。 以下脚本有效,但速度极慢,我想出了一种更 Pythonic 或“Pandorable”的编写方式。
# Convert any non-float fields to IntegerArray (Int)
# Note than IntegerArrays are an experimental addition in Pandas 0.24. They
# allow integer columns to contain NaN fields like float columns.
#
# This is a rather brute-force technique that loops through every column
# and every row. There's got to be a more efficient way to do it since it
# takes a long time and uses up a lot of memory.
def convert_integer (df):
for col in df.columns:
intcol_flag = True
if df[col].dtype == 'float64': # Assuming dtype is "float64"
# TODO: Need to remove inner loop - SLOW!
for val in df[col]:
# If not NaN and the int() value is different from
# the float value, then we have an actual float.
if pd.notnull(val) and abs(val - int(val)) > 1e-6:
intcol_flag = False
break;
# If not a float, change it to an Int based on size
if intcol_flag:
if df[col].abs().max() < 127:
df[col] = df[col].astype('Int8')
elif df[col].abs().max() < 32767:
df[col] = df[col].astype('Int16')
else: # assuming no ints greater than 2147483647
df[col] = df[col].astype('Int32')
print(f"{col} is {df[col].dtype}")
return df
我认为内部 for 循环是问题所在,但我尝试将其替换为:
s = df[col].apply(lambda x: pd.notnull(x) and abs(x - int(x)) > 1e-6)
if s.any():
intcol_flag = False
它仍然很慢。
这是一些示例数据和所需的 output:
np.random.seed(10)
df = pd.DataFrame(np.random.choice([1, 2, 3.3, 5000, 111111, np.NaN], (3,9)),
columns=[f'col{i}' for i in range(9)])
df
col0 col1 col2 col3 col4 col5 col6 col7 col8
0 2.0 NaN 111111.0 1.0 2.0 5000.0 111111.0 2.0 NaN
1 1.0 NaN 2.0 3.3 1.0 2.0 1.0 3.3 1.0
2 111111.0 5000.0 1.0 111111.0 5000.0 1.0 5000.0 3.3 2.0
结果应该是:
col0 is Int32
col1 is Int16
col2 is Int32
col3 is float64
col4 is Int16
col5 is Int16
col6 is Int32
col7 is float64
col8 is Int8
找到需要对每种类型进行类型转换的列,然后为每种类型一次完成所有操作。
import pandas as pd
import numpy as np
np.random.seed(10)
df = pd.DataFrame(np.random.choice([1, 2, 3.3, 5000, 111111, np.NaN], (3,9)),
columns=[f'col{i}' for i in range(9)])
s = pd.cut(df.max(), bins=[0, 127, 32767, 2147483647], labels=['Int8', 'Int16', 'Int32'])
s = s.where((df.dtypes=='float') & (df.isnull() | (df%1 == 0)).all())
# Cast previously # If all values are
# float columns # "I"nteger-like
for idx, gp in s.groupby(s):
df.loc[:, gp.index] = df.loc[:, gp.index].astype(idx)
df.dtypes
#col0 Int32
#col1 Int16
#col2 Int32
#col3 float64
#col4 Int16
#col5 Int16
#col6 Int32
#col7 float64
#col8 Int8
#dtype: object
print(df)
# col0 col1 col2 col3 col4 col5 col6 col7 col8
#0 2 NaN 111111 1.0 2 5000 111111 2.0 NaN
#1 1 NaN 2 3.3 1 2 1 3.3 1
#2 111111 5000 1 111111.0 5000 1 5000 3.3 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.