简体繁体 English

在熊猫系列中将NaN转换为int

[英]Casting NaN into int in a pandas Series

原文 2014-10-28 19:48:13 0 1 python/ numpy/ pandas

I have missing values in a column of a series, so the command dataframe.colname.astype("int64") yields an error. 我在一系列列中缺少值，因此命令dataframe.colname.astype("int64")产生错误。

Any workarounds? 任何解决方法？

1 个解决方案

The datatype or dtype of a pd.Series has very little impact on the actual way it is used. 数据类型或dtype A的pd.Series上有实际使用的方式的影响非常小。

You can have a pd.Series with integers, and set the dtype to be object . 你可以有一个pd.Series与整数，并设置dtype为object 。 You can still do the same things with the pd.Series . 您仍然可以使用pd.Series进行相同的pd.Series 。

However, if you manually set dtypes of pd.Series , pandas will start to cast the entries inside the pd.Series . 但是，如果您手动设置dtypes的pd.Series ，pandas将开始在pd.Series内部pd.Series条目。 In my experience, this only leads to confusion. 以我的经验，这只会导致混乱。

Do not try to use dtypes as field types in relational databases. 不要在关系数据库中尝试使用dtypes作为字段类型。 They are not the same thing. 它们不是同一件事。

If you want to have integes and NaN s/ None s mixed in a pd.Series , just set the dtype to object . 如果要在pd.Series中混合整数和NaN / None ，只需将pd.Series设置为object 。

Settings the dtype to float will let you have float representations of int s and NaN s mixed. 将dtype设置为float将使您可以混合使用int和NaN的float表示形式。 But remember that float s are prone to be unexact in their representation 但是请记住， float的表示形式很可能不准确

One common pitfall with dtypes which I should mention is the pd.merge operation, which will silently refuse to join when the keys used has different dtypes , for example int vs object even if the object only contains int s. 一个常见的错误dtypes我应该提到的是pd.merge操作，这会悄悄地拒绝加入时使用的键有不同的dtypes ，例如int VS object ，即使object仅包含int秒。

Other workarounds 其他解决方法

You can use the Series.fillna method to fill your NaN values with something unlikely. 您可以使用Series.fillna方法用不太可能的值填充NaN值。 0 or -1 . 0或-1 。
Copy the NaN s to a new column df['was_nan'] = pd.isnull(df['floatcol']) , then use the Series.fillna method . 将NaN复制到新列df['was_nan'] = pd.isnull(df['floatcol']) ，然后使用Series.fillna方法。 This way you do not lose any information. 这样您就不会丢失任何信息。
When calling the Series.astype() method, give it the keyword argument raise_on_error=False , and just use the current dtype if it fails. 当调用Series.astype()方法，给它的关键字参数raise_on_error=False ，只是使用当前的dtype ，如果它失败。 Because dtypes do not matter that much. 因为dtypes没什么大不了的。

TLDR; TLDR；

Don't focus on having the 'right dtype', dtypes are strange. 不要专注于拥有“正确的dtype”，dtypes很奇怪。 Focus on what you want the column to actually do. 专注于您希望该列实际执行的操作。 dtype=object is fine. dtype=object很好。