简体   繁体   English

Pandas Dataframe object 类型 fillna 不同数据类型的异常

[英]Pandas Dataframe object types fillna exception over different datatypes

I have a Pandas Dataframe with different dtypes for the different columns.我有一个 Pandas Dataframe,不同的列有不同的数据类型。 Eg df.dtypes returns the following.例如 df.dtypes 返回以下内容。

Date                    datetime64[ns]
FundID                           int64
FundName                        object
CumPos                           int64
MTMPrice                       float64
PricingMechanism                object

Various of cheese columns have missing values in them.各种奶酪列中都有缺失值。 Doing a group operations on it with NaN values in place cause problems.使用 NaN 值对其进行分组操作会导致问题。 To get rid of them with the.fillna() method is the obvious choice.使用 .fillna() 方法摆脱它们是显而易见的选择。 Problem is the obvious clouse for strings are.fillna("") while.fillna(0) is the correct choice for ints and floats.问题是字符串 are.fillna("") 的明显条款,而 .fillna(0) 是整数和浮点数的正确选择。 Using either method on DataFrame throws exception.在 DataFrame 上使用任一方法都会引发异常。 Any elegant solutions besides doing them individually (have about 30 columns)?除了单独执行它们之外还有什么优雅的解决方案(大约有 30 列)? I have a lot of code depending on the DataFrame and would prefer not to retype the columns as it is likely to break some other logic.我有很多代码取决于 DataFrame 并且不想重新键入这些列,因为它可能会破坏其他一些逻辑。 Can do:可以做:

df.FundID.fillna(0)
df.FundName.fillna("")
etc

You can iterate through them and use an if statement!您可以遍历它们并使用if语句!

for col in df:
    #get dtype for column
    dt = df[col].dtype 
    #check if it is a number
    if dt == int or dt == float:
        df[col].fillna(0)
    else:
        df[col].fillna("")

When you iterate through a pandas DataFrame, you will get the names of each of the columns, so to access those columns, you use df[col] .当您遍历 pandas DataFrame 时,您将获得每一列的名称,因此要访问这些列,请使用df[col] This way you don't need to do it manually and the script can just go through each column and check its dtype!这样您就不需要手动执行此操作,脚本只需遍历每一列并检查其 dtype!

You can grab the float64 and object columns using:您可以使用以下方法获取 float64 和 object 列:

In [11]: float_cols = df.blocks['float64'].columns

In [12]: object_cols = df.blocks['object'].columns

and int columns won't have NaNs else they would be upcast to float .和 int 列不会有 NaN 否则它们会被向上转换为 float

Now you can apply the respective fillna s, one cheeky way:现在你可以应用相应的fillna s,一种厚脸皮的方式:

In [13]: d1 = dict((col, '') for col in object_cols)

In [14]: d2 = dict((col, 0) for col in float_cols)

In [15]: df.fillna(value=dict(d1, **d2))

A compact version example:一个紧凑的版本示例:

#replace Nan with '' for columns of type 'object'
df=df.select_dtypes(include='object').fillna('') 

However, after the above operation, the dataframe will only contain the 'object' type columns.但是,在上述操作之后,数据框将只包含“对象”类型的列。 For keeping all columns, use the solution proposed by @Ryan Saxe.要保留所有列,请使用@Ryan Saxe 提出的解决方案。

@Ryan Saxe's answer is accurate. @Ryan Saxe 的回答是准确的。 To get it to work on my data I had to set inplace=True and also data= 0 and data= "" .为了让它处理我的数据,我必须设置inplace=True以及data= 0data= "" See code below:见下面的代码:

for col in df:
    #get dtype for column
    dt = df[col].dtype 
    #check if it is a number
    if dt == int or dt == float:
        df[col].fillna(data=0, inplace=True)
    else:
        df[col].fillna(data="", inplace=True)

Rather than running the conversion one column at a time, which is inefficient , here is a way to grab all of the int or float columns and change in one shot.与其一次运行一列转换,这是低效的,这里有一种获取所有 int 或 float 列并一次性更改的方法。

int_float_cols = df.select_dtypes(include=['int', 'float']).columns
df[int_float_cols] = df[int_float_cols].fillna(value=0)

Obvious how to adapt this to handle object.很明显如何调整它来处理 object。

I'm aware that in Pandas older versions, there were no NAs allowed in integers, so grabbing the "ints" is not strictly necessary and it may accidentially promote ints to floats.我知道在 Pandas 旧版本中,整数中不允许使用 NA,因此获取“整数”并不是绝对必要的,它可能会意外地将整数提升为浮点数。 However, in our use case, it is better to be safe than sorry.但是,在我们的用例中,安全总比后悔好。

I ran into this because ordinary approach, df.fillna(0) corrupted all of the datetime variables.我遇到这个是因为普通的方法df.fillna(0)破坏了所有的日期时间变量。

类似于@Guddi:有点冗长,但仍然比@Ryan 的答案更简洁并保留所有列:

df[df.select_dtypes("object").columns] = df.select_dtypes("object").fillna("")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM