繁体   English   中英

有效地将大型 Pandas DataFrame 列从浮点数转换为整数

[英]Efficiently convert large Pandas DataFrame columns from float to int

这与在 dataframe 中存在 NaN 时使用 astype 的错误不同,因为我需要保留 NaN 值,因此我选择使用实验性IntegerArray 这个问题的症结在于试图避免循环。

我们有许多大型医疗数据集,我从 SAS 导入到 Pandas。 大多数字段是枚举类型,应该表示为整数,但它们以 float64 的形式出现,因为许多字段包含 NaN 值。 Pandas 中的实验 IntegerArray 类型解决了 NaN 问题。 但是,这些数据集非常大,我想根据数据本身将它们转换为脚本。 以下脚本有效,但速度极慢,我想出了一种更 Pythonic 或“Pandorable”的编写方式。

# Convert any non-float fields to IntegerArray (Int)
# Note than IntegerArrays are an experimental addition in Pandas 0.24. They
# allow integer columns to contain NaN fields like float columns.
#
# This is a rather brute-force technique that loops through every column
# and every row. There's got to be a more efficient way to do it since it 
# takes a long time and uses up a lot of memory.
def convert_integer (df):
    for col in df.columns:
        intcol_flag = True
        if df[col].dtype == 'float64':   # Assuming dtype is "float64"
            # TODO: Need to remove inner loop - SLOW!
            for val in df[col]:
                # If not NaN and the int() value is different from
                # the float value, then we have an actual float.
                if pd.notnull(val) and abs(val - int(val)) > 1e-6:
                    intcol_flag = False
                    break;
            # If not a float, change it to an Int based on size
            if intcol_flag:
                if df[col].abs().max() < 127:
                    df[col] = df[col].astype('Int8')
                elif df[col].abs().max() < 32767:
                    df[col] = df[col].astype('Int16')
                else:   # assuming no ints greater than 2147483647 
                    df[col] = df[col].astype('Int32') 
        print(f"{col} is {df[col].dtype}")
    return df

我认为内部 for 循环是问题所在,但我尝试将其替换为:

            s = df[col].apply(lambda x: pd.notnull(x) and abs(x - int(x)) > 1e-6)
            if s.any():
                intcol_flag = False

它仍然很慢。

这是一些示例数据和所需的 output:

np.random.seed(10)
df = pd.DataFrame(np.random.choice([1, 2, 3.3, 5000, 111111, np.NaN], (3,9)), 
                  columns=[f'col{i}' for i in range(9)])
df

    col0    col1    col2    col3    col4    col5    col6    col7    col8
0   2.0     NaN   111111.0  1.0     2.0   5000.0  111111.0  2.0     NaN
1   1.0     NaN     2.0     3.3     1.0      2.0     1.0    3.3     1.0
2  111111.0 5000.0  1.0   111111.0  5000.0   1.0    5000.0  3.3     2.0

结果应该是:

col0 is Int32
col1 is Int16
col2 is Int32
col3 is float64
col4 is Int16
col5 is Int16
col6 is Int32
col7 is float64
col8 is Int8

找到需要对每种类型进行类型转换的列,然后为每种类型一次完成所有操作。

样本数据

import pandas as pd
import numpy as np

np.random.seed(10)
df = pd.DataFrame(np.random.choice([1, 2, 3.3, 5000, 111111, np.NaN], (3,9)), 
                  columns=[f'col{i}' for i in range(9)])

代码

s = pd.cut(df.max(), bins=[0, 127, 32767, 2147483647], labels=['Int8', 'Int16', 'Int32'])
s = s.where((df.dtypes=='float') & (df.isnull() | (df%1 == 0)).all())
            # Cast previously       # If all values are 
            # float columns         # "I"nteger-like

for idx, gp in s.groupby(s):
    df.loc[:, gp.index] = df.loc[:, gp.index].astype(idx)

df.dtypes
#col0      Int32
#col1      Int16
#col2      Int32
#col3    float64
#col4      Int16
#col5      Int16
#col6      Int32
#col7    float64
#col8       Int8
#dtype: object

print(df)
#     col0  col1    col2      col3  col4  col5    col6  col7  col8
#0       2   NaN  111111       1.0     2  5000  111111   2.0   NaN
#1       1   NaN       2       3.3     1     2       1   3.3     1
#2  111111  5000       1  111111.0  5000     1    5000   3.3     2

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM