简体   繁体   English

将(行)函数应用于 DataFrame 会更改列类型

[英]applying a (row) function to a DataFrame changes column types

Having an issue with unintended changes to column types, distilled as shown below.对列类型进行意外更改时遇到问题,蒸馏如下所示。 Column x is floats, column icol is ints. x 列是浮点数,icol 列是整数。 when the testfunction (which does nothing) is applied, column icol is changed to type float64, as demonstrated by this code:当应用测试函数(什么都不做)时,列 icol 更改为 float64 类型,如以下代码所示:

df = pd.DataFrame({'x':[1000, -1000, 1.0]})       
df['icol'] = 1
print(df.dtypes)

def testfunction(r):
    pass
    return(r)
df = df.apply(testfunction, axis='columns')
print(df.dtypes)

However, if I make both the x and icol columns ints, then the types do not get changed.但是,如果我将 x 和 icol 列都设为整数,则类型不会改变。

df = pd.DataFrame({'x':[1000, -1000]})       
df['icol'] = 1
print(df.dtypes)

def testfunction(r):
    pass
    return(r)
df = df.apply(testfunction, axis='columns')
print(df.dtypes)

This is a potential hazard, for example if one may use an int column as a key later, etc.这是一种潜在的危险,例如,如果以后可能会使用 int 列作为键,等等。

Is this a feature, or am I doing something wrong here ?这是一个功能,还是我在这里做错了什么? running python 3.7.3 on ubuntu在 ubuntu 上运行 python 3.7.3

Thanks谢谢

All Pandas operations try to be as numerically efficient as possible.所有 Pandas 操作都试图尽可能提高数值效率。 When applying an operation to a row, Pandas tries to construct a Series from the row first.对一行应用操作时,Pandas 会首先尝试从该行构造一个Series If the row is a mix of ints and floats, these will be converted to floats, just like when you pass a mixed list to the Series constructor: Series([1000.0, 1]) is converted to all floats: ie Series([1000.0, 1.0])如果行是整数和浮点数的混合,它们将被转换为浮点数,就像将混合列表传递给Series构造函数时一样: Series([1000.0, 1])被转换为所有浮点数:即Series([1000.0, 1.0])

Consequentially, if your row contains a string, the object dtype is used and all of the types are preserved at the cost of performance.因此,如果您的行包含字符串,则使用object dtype 并以性能为代价保留所有类型。 In general, you should avoid apply if at all possible and use other Pandas methods to get the results.一般来说,您应该尽可能避免apply并使用其他 Pandas 方法来获得结果。

df = pd.DataFrame({'x':[1000, -1000, 1.0]})
df['y'] = 1
df['z'] = 'hello'

print(df.apply(testfunction, axis='columns').dtypes)
# prints:
x    float64
y      int64
z     object
dtype: object

Thanks for the informative answer and comments.感谢您提供信息丰富的回答和评论。 Here is another simple work-around, for anyone else who doesn't want to repent from using the row function pattern:这是另一个简单的解决方法,适用于不想因使用 row 函数模式而后悔的其他人:

df = pd.DataFrame({'x':[1000, -1000.1]})       
df['icol'] = 1
print(df.dtypes)

def testfunction(r):
    pass
    return(r)

# save the types    
types = df.dtypes

df = df.apply(testfunction, axis='columns')
print(df.dtypes)

# put 'em back
df = df.astype(types.to_dict(), copy=False)

print(df.dtypes)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM