Using pandas.cut to split continuous variable, but it's skipping values?

Question

I'm trying to use pandas.cut to categorise some continuous data - "capital gains".

Wondering if someone could please advise why some of the capital gains data points are not being binned accordingly? For example the 159 counts of '99999', are not falling into the [45000, 110000] bracket.

I'm using the adult data set from this location. https://archive.ics.uci.edu/ml/datasets/adult

df = pd.read_csv(...)
df['capital-gain'] = df['capital-gain'].replace(pd.cut(df['capital-gain'], [0,45000,110000,150000], right=False, include_lowest=True))
df['capital-gain'].value_counts()

the output is

[0, 45000)         32387
99999                159  #why is this here, and not falling into the group below?
[45000, 110000)        8
34095                  5
41310                  2
Name: capital-gain, dtype: int64

I have checked that data points do not have spaces in the field.

Thank you in advance to anyone who has time to respond.

Answer 1

您不需要替换这些值，而只需pd.cut()使用pd.cut() 。

df['capital-gain'] = pd.cut(df['capital-gain'], [0,45000,110000,150000], right=False, include_lowest=True)

Using pandas.cut to split continuous variable, but it's skipping values?

Question

1 answers

solution1
1 ACCPTED 2021-06-27 03:31:49

Using pandas.cut to split continuous variable, but it's skipping values?

Question

1 answers

solution1 1 ACCPTED 2021-06-27 03:31:49

solution1
1 ACCPTED 2021-06-27 03:31:49