Python/Pandas - Best way to group by criteria?

Question

I have tried to find an answer to my question, but maybe I'm just not applying the solutions correctly to my situation. This is what I created to group some rows in my datasheet into income groups. I created 4 new dataframes and then concatenated them after applying an index to each. Is this optimal or is there a better way to do things?

I should add my goal is to create a boxplot using these new groups and the boxpot "by=" argument.

df_nonull1 = df_nonull[(df_nonull['mn_earn_wne_p6'] < 20000)]
df_nonull2 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 20000) & (df_nonull['mn_earn_wne_p6'] < 30000)]
df_nonull3 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 30000) & (df_nonull['mn_earn_wne_p6'] < 40000)]
df_nonull4 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 40000)]

df_nonull1['inc_index'] = 1
df_nonull2['inc_index'] = 2
df_nonull3['inc_index'] = 3
df_nonull4['inc_index'] = 4
frames = [df_nonull1,df_nonull2,df_nonull3,df_nonull4]
results = pd.concat(frames)

Answer 1

If all your values are between 10k and 50k, you can assign your index using integer division (//):

df_nonull['inc_index'] = df_nonull.mn_earn_wne_p6 // 10000

You don't need to to break up your dataframes and concatenate them, you need to find a way to create your inc_index from your mn_earn_wne_p6 field.

Answer 2

Edit. As Paul mentioned in the comments, there is a pd.cut function for exactly this sort of thing, which is much more elegant than my original answer.

# equal-width bins
df['inc_index'] = pd.cut(df.A, bins=4, labels=[1, 2, 3, 4])

# custom bin edges
df['inc_index'] = pd.cut(df.A, bins=[0, 20000, 30000, 40000, 50000],
                         labels=[1, 2, 3, 4])

Note that the labels argument is optional. pd.cut produces an ordered categorical Series , so you can sort by the resulting column regardless of labels:

df = pd.DataFrame(np.random.randint(1, 20, (10, 2)), columns=list('AB'))
df['inc_index'] = pd.cut(df.A, bins=[0, 7, 13, 15, 20])
print df.sort_values('inc_index')

which outputs (modulo random numbers)

    A   B inc_index
6   2  16    (0, 7]
7   5   5    (0, 7]
3  12   6   (7, 13]
4  10   8   (7, 13]
5   9  13   (7, 13]
1  15  10  (13, 15]
2  15   7  (13, 15]
8  15  13  (13, 15]
0  18  10  (15, 20]
9  16  12  (15, 20]

Original solution. This is a generalization on Alexander's answer to variable bucket widths. You can build the inc_index column using Series.apply . For example,

def bucket(v):
    # of course, the thresholds can be arbitrary
    if v < 20000:
        return 1
    if v < 30000:
        return 2
    if v < 40000:
        return 3
    return 4

df['inc_index'] = df.mn_earn_wne_p6.apply(bucket)

or, if you really want to avoid a def ,

df['inc_index'] = df.mn_earn_wne_p6.apply(
    lambda v: 1 if v < 20000 else 2 if v < 30000 else 3 if v < 40000 else 4)

Note that if you just want to subdivide the range of mn_earn_wne_p6 into equal buckets, then Alexander's way is much cleaner and faster.

df['inc_index'] = df.mn_earn_wne_p6 // bucket_width

Then, to produce the result you want, you can just sort by this column.

df.sort_values('inc_index')

You can also groupby('inc_index') to aggregate results within each bucket.

Python/Pandas - Best way to group by criteria?

Question

2 answers

solution1
2 2016-03-31 22:56:23

solution2
1 ACCPTED 2016-03-31 23:16:01

Python/Pandas - Best way to group by criteria?

Question

2 answers

solution1 2 2016-03-31 22:56:23

solution2 1 ACCPTED 2016-03-31 23:16:01

solution1
2 2016-03-31 22:56:23

solution2
1 ACCPTED 2016-03-31 23:16:01