简体   繁体   English

当 function 在组中找不到分箱边界时,pandas 切割保留 nans

[英]pandas cut preserving nans when the binning boundaries are not found in the group by function

I am getting strange behaviour in pandas cut function.我在 pandas 切 function 中出现奇怪的行为。 Suppose I have this dataframe:假设我有这个 dataframe:

df = pd.DataFrame([1, 4, 8, 9], columns=['A'])

and I want to do binning based on this values.我想根据这个值进行分箱。

bins = list(range(0, 10))

As Normally, I would expect like this:像往常一样,我希望这样:

df['binned'] = pd.cut(df['A'], bins=bins)

print(df)打印(df)

A  binned
1  (0, 1]
4  (3, 4]
8  (7, 8]
9  (8, 9]

So, far all good.所以,到目前为止一切都很好。 But when I try to groupby on the binned columns;但是当我尝试对分箱列进行分组时; suddenly those extra NANS are coming.突然间,那些额外的 NANS 来了。

df = df.groupby('binned', as_index=False).max()
print(df)

binned    A
(0, 1]  1.0
(1, 2]  NaN
(2, 3]  NaN
(3, 4]  4.0
(4, 5]  NaN
(5, 6]  NaN
(6, 7]  NaN
(7, 8]  8.0
(8, 9]  9.0

why those NANS binning were preserved.为什么保留那些 NANS 分箱。 If it was there from the beginning, why it was showing only in the groupby but doesn't show before.如果它从一开始就在那里,为什么它只在 groupby 中显示而之前没有显示。

If this is expected behaviour, then how can I remove those NANS before going in to the groupby function.如果这是预期的行为,那么在进入 groupby function 之前如何删除这些 NANS。

I even did the dropna before doing the groupby but that doesn't help because there was no NANS rows and it doesn't do anything.我什至在做 groupby 之前做了 dropna 但这无济于事,因为没有 NANS 行而且它什么也没做。

You need to set observed=True , because your 'Binned' column contains categorical values.您需要设置observed=True ,因为您的 'Binned' 列包含分类值。 In categorical data, all categories are preserved.在分类数据中,所有类别都被保留。

df.groupby('binned', as_index=False, observed=True).max()

As you can see when you check df['binned'].dtype , the type is: CategoricalDtype(categories=[(0, 1], (1, 2], (2, 3], (3, 4], (4, 5], (5, 6], (6, 7], (7, 8], (8, 9]], ordered=True)正如您在检查df['binned'].dtype时看到的那样,类型为: CategoricalDtype(categories=[(0, 1], (1, 2], (2, 3], (3, 4], (4, 5], (5, 6], (6, 7], (7, 8], (8, 9]], ordered=True)

So this is where the information is preserved.所以这是保存信息的地方。 Not in the values but in the datatype of the column.不在值中,而是在列的数据类型中。

From the documentation on groupby:从关于 groupby 的文档中:

observed: bool, default False观察到:bool,默认 False

This only applies if any of the groupers are Categoricals.这仅适用于任何 groupers 是分类的。 If True: only show observed values for categorical groupers.如果为真:仅显示分类分组的观察值。 If False: show all values for categorical groupers.如果为 False:显示分类分组的所有值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM