简体   繁体   English

Pandas:ValueError:无法将浮点 NaN 转换为整数

[英]Pandas: ValueError: cannot convert float NaN to integer

I get ValueError: cannot convert float NaN to integer for following:我得到ValueError: cannot convert float NaN to integer for following:

df = pandas.read_csv('zoom11.csv')
df[['x']] = df[['x']].astype(int)
  • The "x" is obviously a column in the csv file, but I cannot spot any float NaN in the file, and dont get what does it mean by this. “x”显然是 csv 文件中的一列,但我无法在文件中发现任何浮点 NaN ,也不明白这是什么意思。
  • When I read the column as String, then it has values like -1,0,1,...2000, all look very nice int numbers to me.当我将列读取为字符串时,它具有像 -1,0,1,...2000 这样的值,对我来说,所有的整数看起来都非常好。
  • When I read the column as float, then this can be loaded.当我将列读取为浮动时,可以加载它。 Then it shows values as -1.0,0.0 etc, still there are no any NaN-s然后它将值显示为 -1.0,0.0 等,仍然没有任何 NaN-s
  • I tried with error_bad_lines = False and dtype parameter in read_csv to no avail.我尝试在read_csv 中使用error_bad_lines = False和 dtype 参数无济于事。 It just cancels loading with same exception.它只是以相同的异常取消加载。
  • The file is not small (10+ M rows), so cannot inspect it manually, when I extract a small header part, then there is no error, but it happens with full file.该文件不小(10+ M 行),因此无法手动检查它,当我提取一个小的标题部分时,没有错误,但它发生在完整文件中。 So it is something in the file, but cannot detect what.所以它是文件中的东西,但无法检测到什么。
  • Logically the csv should not have missing values, but even if there is some garbage then I would be ok to skip the rows.从逻辑上讲,csv 不应该有缺失值,但即使有一些垃圾,我也可以跳过这些行。 Or at least identify them, but I do not see way to scan through file and report conversion errors.或者至少识别它们,但我看不到扫描文件和报告转换错误的方法。

Update: Using the hints in comments/answers I got my data clean with this:更新:使用评论/答案中的提示,我用这个清理了我的数据:

# x contained NaN
df = df[~df['x'].isnull()]

# Y contained some other garbage, so null check was not enough
df = df[df['y'].str.isnumeric()]

# final conversion now worked
df[['x']] = df[['x']].astype(int)
df[['y']] = df[['y']].astype(int)

For identifying NaN values use boolean indexing :要识别NaN值,请使用boolean indexing

print(df[df['x'].isnull()])

Then for removing all non-numeric values use to_numeric with parameter errors='coerce' - to replace non-numeric values to NaN s:然后删除所有非数字值使用to_numeric和参数errors='coerce' - 将非数字值替换为NaN s:

df['x'] = pd.to_numeric(df['x'], errors='coerce')

And for remove all rows with NaN s in column x use dropna :要删除x列中带有NaN的所有行,请使用dropna

df = df.dropna(subset=['x'])

Last convert values to int s:最后将值转换为int s:

df['x'] = df['x'].astype(int)

ValueError: cannot convert float NaN to integer ValueError:无法将浮点 NaN 转换为整数

From v0.24, you actually can.从 v0.24 开始,您实际上可以。 Pandas introduces Nullable Integer Data Types which allows integers to coexist with NaNs. Pandas 引入了Nullable Integer 数据类型,它允许整数与 NaN 共存。

Given a series of whole float numbers with missing data,给定一系列缺失数据的整浮点数,

s = pd.Series([1.0, 2.0, np.nan, 4.0])
s

0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

s.dtype
# dtype('float64')

You can convert it to a nullable int type (choose from one of Int16 , Int32 , or Int64 ) with,您可以将其转换为可为空的 int 类型(从Int16Int32Int64之一中选择),

s2 = s.astype('Int32') # note the 'I' is uppercase
s2

0      1
1      2
2    NaN
3      4
dtype: Int32

s2.dtype
# Int32Dtype()

Your column needs to have whole numbers for the cast to happen.您的专栏需要有整数才能进行演员表。 Anything else will raise a TypeError:其他任何事情都会引发 TypeError:

s = pd.Series([1.1, 2.0, np.nan, 4.0])

s.astype('Int32')
# TypeError: cannot safely cast non-equivalent float64 to int32

Also, even at the lastest versions of pandas if the column is object type you would have to convert into float first, something like:此外,即使在最新版本的熊猫中,如果列是对象类型,您也必须先转换为浮点数,例如:

df['column_name'].astype(np.float).astype("Int32")

NB: You have to go through numpy float first and then to nullable Int32, for some reason.注意:出于某种原因,您必须先通过 numpy float 再到可空 Int32。

The size of the int if it's 32 or 64 depends on your variable, be aware you may loose some precision if your numbers are to big for the format. int 的大小(如果是 32 或 64)取决于您的变量,请注意,如果您的数字对于格式来说太大,您可能会失去一些精度。

I know this has been answered but wanted to provide alternate solution for anyone in the future:我知道这已得到解答,但希望将来为任何人提供替代解决方案:

You can use .loc to subset the dataframe by only values that are notnull() , and then subset out the 'x' column only.您可以使用.loc仅按notnull()值对数据帧进行子集化,然后仅对'x'列进行子集化。 Take that same vector, and apply(int) to it.取相同的向量,然后对其apply(int)

If column x is float:如果列 x 是浮动的:

df.loc[df['x'].notnull(), 'x'] = df.loc[df['x'].notnull(), 'x'].apply(int)

如果你有空值那么在做数学运算你会得到这个错误来解决它使用df[~df['x'].isnull()]df[['x']].astype(int)如果你想要你的数据集不可更改。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM