如何不使用 pandas 切割 function 来估算 NaN 值？

Question

I'm trying to use the cut function to convert numeric data into categories.我正在尝试使用 cut function 将数字数据转换为类别。 My input data may have NaN values, which I would like to stay NaN after the cut.我的输入数据可能有 NaN 值，我想在剪切后保持 NaN。 From what I understand reading the documentation, this is the default behavior and the following code should work:根据我阅读文档的理解，这是默认行为，以下代码应该可以工作：

intervals = [(i, i+1) for i in range(101)]
bins = pd.IntervalIndex.from_tuples(intervals)
pd.cut(pd.Series([np.nan,0.5,10]),bins)

However, the output I get is:但是，我得到的 output 是：

>(49,50]
 (0,1]
 (9,10]

Notice that the NaN value is converted to the middle interval.请注意， NaN值转换为中间区间。

One strange thing is that it appears as though once the number of intervals is 100 or less, I get the desired output:一件奇怪的事情是，一旦间隔数为 100 或更少，我就会得到所需的 output：

intervals = [(i, i+1) for i in range(100)]
bins = pd.IntervalIndex.from_tuples(intervals)
pd.cut(pd.Series([np.nan,0.5,10]),bins)

output: output：

>NaN
 (0,1]
 (9,10]

Is there a way to specify that I don't want NaN values to be imputed?有没有办法指定我不想估算 NaN 值？

Answer 1

This seems like a bug that originates from numpy.searchsorted() :这似乎是一个源自numpy.searchsorted()的错误：

pandas-dev/pandas#31586 - pd.cut returning incorrect output in some cases pandas-dev/pandas#31586 - pd.cut 在某些情况下返回不正确的 output
numpy/numpy#15499 - BUG: searchsorted with object arrays containing nan numpy/numpy#15499 - BUG: searchsorted with object arrays 包含 nan

As a workaround, you could replace np.nan with some other guaranteed missing value, eg .replace(np.nan,'foo') :作为一种解决方法，您可以将np.nan替换为其他一些有保证的缺失值，例如.replace(np.nan,'foo') ：

intervals = [(i, i+1) for i in range(101)]
bins = pd.IntervalIndex.from_tuples(intervals)
pd.cut(pd.Series([np.nan,0.5,10]).replace(np.nan,'foo'),bins)

0            NaN
1     (0.0, 1.0]
2    (9.0, 10.0]
dtype: category

如何不使用 pandas 切割 function 来估算 NaN 值？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-05-05 22:13:17

如何不使用 pandas 切割 function 来估算 NaN 值？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-05-05 22:13:17

解决方案1
1 已采纳 2021-05-05 22:13:17