简体   繁体   中英

Using pandas.cut to split continuous variable, but it's skipping values?

I'm trying to use pandas.cut to categorise some continuous data - "capital gains".

Wondering if someone could please advise why some of the capital gains data points are not being binned accordingly? For example the 159 counts of '99999', are not falling into the [45000, 110000] bracket.

I'm using the adult data set from this location. https://archive.ics.uci.edu/ml/datasets/adult

df = pd.read_csv(...)
df['capital-gain'] = df['capital-gain'].replace(pd.cut(df['capital-gain'], [0,45000,110000,150000], right=False, include_lowest=True))
df['capital-gain'].value_counts()

the output is

[0, 45000)         32387
99999                159  #why is this here, and not falling into the group below?
[45000, 110000)        8
34095                  5
41310                  2
Name: capital-gain, dtype: int64

I have checked that data points do not have spaces in the field.

Thank you in advance to anyone who has time to respond.

您不需要替换这些值,而只需pd.cut()使用pd.cut()

df['capital-gain'] = pd.cut(df['capital-gain'], [0,45000,110000,150000], right=False, include_lowest=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM