简体   繁体   中英

applying a (row) function to a DataFrame changes column types

Having an issue with unintended changes to column types, distilled as shown below. Column x is floats, column icol is ints. when the testfunction (which does nothing) is applied, column icol is changed to type float64, as demonstrated by this code:

df = pd.DataFrame({'x':[1000, -1000, 1.0]})       
df['icol'] = 1
print(df.dtypes)

def testfunction(r):
    pass
    return(r)
df = df.apply(testfunction, axis='columns')
print(df.dtypes)

However, if I make both the x and icol columns ints, then the types do not get changed.

df = pd.DataFrame({'x':[1000, -1000]})       
df['icol'] = 1
print(df.dtypes)

def testfunction(r):
    pass
    return(r)
df = df.apply(testfunction, axis='columns')
print(df.dtypes)

This is a potential hazard, for example if one may use an int column as a key later, etc.

Is this a feature, or am I doing something wrong here ? running python 3.7.3 on ubuntu

Thanks

All Pandas operations try to be as numerically efficient as possible. When applying an operation to a row, Pandas tries to construct a Series from the row first. If the row is a mix of ints and floats, these will be converted to floats, just like when you pass a mixed list to the Series constructor: Series([1000.0, 1]) is converted to all floats: ie Series([1000.0, 1.0])

Consequentially, if your row contains a string, the object dtype is used and all of the types are preserved at the cost of performance. In general, you should avoid apply if at all possible and use other Pandas methods to get the results.

df = pd.DataFrame({'x':[1000, -1000, 1.0]})
df['y'] = 1
df['z'] = 'hello'

print(df.apply(testfunction, axis='columns').dtypes)
# prints:
x    float64
y      int64
z     object
dtype: object

Thanks for the informative answer and comments. Here is another simple work-around, for anyone else who doesn't want to repent from using the row function pattern:

df = pd.DataFrame({'x':[1000, -1000.1]})       
df['icol'] = 1
print(df.dtypes)

def testfunction(r):
    pass
    return(r)

# save the types    
types = df.dtypes

df = df.apply(testfunction, axis='columns')
print(df.dtypes)

# put 'em back
df = df.astype(types.to_dict(), copy=False)

print(df.dtypes)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM