简体   繁体   English

如何不使用 pandas 切割 function 来估算 NaN 值?

[英]How to not impute NaN values with pandas cut function?

I'm trying to use the cut function to convert numeric data into categories.我正在尝试使用 cut function 将数字数据转换为类别。 My input data may have NaN values, which I would like to stay NaN after the cut.我的输入数据可能有 NaN 值,我想在剪切后保持 NaN。 From what I understand reading the documentation, this is the default behavior and the following code should work:根据我阅读文档的理解,这是默认行为,以下代码应该可以工作:

intervals = [(i, i+1) for i in range(101)]
bins = pd.IntervalIndex.from_tuples(intervals)
pd.cut(pd.Series([np.nan,0.5,10]),bins)

However, the output I get is:但是,我得到的 output 是:

>(49,50]
 (0,1]
 (9,10]

Notice that the NaN value is converted to the middle interval.请注意, NaN值转换为中间区间。

One strange thing is that it appears as though once the number of intervals is 100 or less, I get the desired output:一件奇怪的事情是,一旦间隔数为 100 或更少,我就会得到所需的 output:

intervals = [(i, i+1) for i in range(100)]
bins = pd.IntervalIndex.from_tuples(intervals)
pd.cut(pd.Series([np.nan,0.5,10]),bins)

output: output:

>NaN
 (0,1]
 (9,10]

Is there a way to specify that I don't want NaN values to be imputed?有没有办法指定我不想估算 NaN 值?

This seems like a bug that originates from numpy.searchsorted() :这似乎是一个源自numpy.searchsorted()的错误:

As a workaround, you could replace np.nan with some other guaranteed missing value, eg .replace(np.nan,'foo') :作为一种解决方法,您可以将np.nan替换为其他一些有保证的缺失值,例如.replace(np.nan,'foo')

intervals = [(i, i+1) for i in range(101)]
bins = pd.IntervalIndex.from_tuples(intervals)
pd.cut(pd.Series([np.nan,0.5,10]).replace(np.nan,'foo'),bins)

0            NaN
1     (0.0, 1.0]
2    (9.0, 10.0]
dtype: category

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM