I have tried to find an answer to my question, but maybe I'm just not applying the solutions correctly to my situation. This is what I created to group some rows in my datasheet into income groups. I created 4 new dataframes and then concatenated them after applying an index to each. Is this optimal or is there a better way to do things?
I should add my goal is to create a boxplot using these new groups and the boxpot "by=" argument.
df_nonull1 = df_nonull[(df_nonull['mn_earn_wne_p6'] < 20000)]
df_nonull2 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 20000) & (df_nonull['mn_earn_wne_p6'] < 30000)]
df_nonull3 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 30000) & (df_nonull['mn_earn_wne_p6'] < 40000)]
df_nonull4 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 40000)]
df_nonull1['inc_index'] = 1
df_nonull2['inc_index'] = 2
df_nonull3['inc_index'] = 3
df_nonull4['inc_index'] = 4
frames = [df_nonull1,df_nonull2,df_nonull3,df_nonull4]
results = pd.concat(frames)
If all your values are between 10k and 50k, you can assign your index using integer division (//):
df_nonull['inc_index'] = df_nonull.mn_earn_wne_p6 // 10000
You don't need to to break up your dataframes and concatenate them, you need to find a way to create your inc_index
from your mn_earn_wne_p6
field.
Edit. As Paul mentioned in the comments, there is a pd.cut
function for exactly this sort of thing, which is much more elegant than my original answer.
# equal-width bins
df['inc_index'] = pd.cut(df.A, bins=4, labels=[1, 2, 3, 4])
# custom bin edges
df['inc_index'] = pd.cut(df.A, bins=[0, 20000, 30000, 40000, 50000],
labels=[1, 2, 3, 4])
Note that the labels
argument is optional. pd.cut
produces an ordered categorical Series
, so you can sort by the resulting column regardless of labels:
df = pd.DataFrame(np.random.randint(1, 20, (10, 2)), columns=list('AB'))
df['inc_index'] = pd.cut(df.A, bins=[0, 7, 13, 15, 20])
print df.sort_values('inc_index')
which outputs (modulo random numbers)
A B inc_index
6 2 16 (0, 7]
7 5 5 (0, 7]
3 12 6 (7, 13]
4 10 8 (7, 13]
5 9 13 (7, 13]
1 15 10 (13, 15]
2 15 7 (13, 15]
8 15 13 (13, 15]
0 18 10 (15, 20]
9 16 12 (15, 20]
Original solution. This is a generalization on Alexander's answer to variable bucket widths. You can build the inc_index
column using Series.apply
. For example,
def bucket(v):
# of course, the thresholds can be arbitrary
if v < 20000:
return 1
if v < 30000:
return 2
if v < 40000:
return 3
return 4
df['inc_index'] = df.mn_earn_wne_p6.apply(bucket)
or, if you really want to avoid a def
,
df['inc_index'] = df.mn_earn_wne_p6.apply(
lambda v: 1 if v < 20000 else 2 if v < 30000 else 3 if v < 40000 else 4)
Note that if you just want to subdivide the range of mn_earn_wne_p6
into equal buckets, then Alexander's way is much cleaner and faster.
df['inc_index'] = df.mn_earn_wne_p6 // bucket_width
Then, to produce the result you want, you can just sort by this column.
df.sort_values('inc_index')
You can also groupby('inc_index')
to aggregate results within each bucket.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.