使用 pandas.cut 拆分连续变量，但它正在跳过值？

Question

我正在尝试使用 pandas.cut 对一些连续数据进行分类 - “资本收益”。

想知道是否有人可以建议为什么某些资本收益数据点没有被相应地分类？ 例如，'99999' 的 159 个计数不属于 [45000, 110000] 括号。

我正在使用这个位置的成人数据集。 https://archive.ics.uci.edu/ml/datasets/adult

df = pd.read_csv(...)
df['capital-gain'] = df['capital-gain'].replace(pd.cut(df['capital-gain'], [0,45000,110000,150000], right=False, include_lowest=True))
df['capital-gain'].value_counts()

输出是

[0, 45000)         32387
99999                159  #why is this here, and not falling into the group below?
[45000, 110000)        8
34095                  5
41310                  2
Name: capital-gain, dtype: int64

我已经检查过数据点在该字段中没有空格。

提前感谢任何有时间回复的人。

Answer 1

您不需要替换这些值，而只需pd.cut()使用pd.cut() 。

df['capital-gain'] = pd.cut(df['capital-gain'], [0,45000,110000,150000], right=False, include_lowest=True)

使用 pandas.cut 拆分连续变量，但它正在跳过值？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-06-27 03:31:49

使用 pandas.cut 拆分连续变量，但它正在跳过值？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-06-27 03:31:49

解决方案1
1 已采纳 2021-06-27 03:31:49