How do I make bins of equal number of observations in a pandas dataframe?

Question

I'm trying to make a column in a dataframe depicting a group or bin that observation belongs to. The idea is to sort the dataframe according to some column, then develop another column denoting which bin that observation belongs to. If I want deciles, then I should be able to tell a function I want 10 equal (or close to equal) groups.

I tried the pandas qcut but that just gives a tuples of the the upper and lower limits of the bins. I would like just 1,2,3,4....etc. Take the following for example

import numpy as np
import pandas as pd

x = [1,2,3,4,5,6,7,8,5,45,64545,65,6456,564]
y = np.random.rand(len(x))

df_dict = {'x': x, 'y': y}
df = pd.DataFrame(df_dict)

This gives a df of 14 observations. How could I get groups of 5 equal bins?

The desired result would be the following:

        x         y  group
0       1  0.926273      1
1       2  0.678101      1
2       3  0.636875      1
3       4  0.802590      2
4       5  0.494553      2
5       6  0.874876      2
6       7  0.607902      3
7       8  0.028737      3
8       5  0.493545      3
9      45  0.498140      4
10  64545  0.938377      4
11     65  0.613015      4
12   6456  0.288266      5
13    564  0.917817      5

Answer 1

Group by N rows, and find ngroup

df['group']=df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1



     x      y        group
0       1  0.548801      1
1       2  0.096620      1
2       3  0.713771      1
3       4  0.922987      2
4       5  0.283689      2
5       6  0.807755      2
6       7  0.592864      3
7       8  0.670315      3
8       5  0.034549      3
9      45  0.355274      4
10  64545  0.239373      4
11     65  0.156208      4
12   6456  0.419990      5
13    564  0.248278      5

Answer 2

You can split evenly with np.array_split() , assign the groups, then recombine withpd.concat() :

bins = 5
splits = np.array_split(df, bins)

for i in range(len(splits)):
    splits[i]['group'] = i + 1

df = pd.concat(splits)

Or as a one-liner using assign() :

df = pd.concat([d.assign(group=i+1) for i, d in enumerate(np.array_split(df, bins))])

        x         y  group
0       1  0.145781      1
1       2  0.262097      1
2       3  0.114799      1
3       4  0.275054      2
4       5  0.841606      2
5       6  0.187210      2
6       7  0.582487      3
7       8  0.019881      3
8       5  0.847115      3
9      45  0.755606      4
10  64545  0.196705      4
11     65  0.688639      4
12   6456  0.275884      5
13    564  0.579946      5

Answer 3

Another option by generating list of indexes from near_split :

def near_split(base, num_bins):
    quotient, remainder = divmod(base, num_bins)
    return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)


bins = 5
df['group'] = [i + 1 for i, v in enumerate(near_split(len(df), bins)) for _ in range(v)]
print(df)

Output:

        x         y  group
0       1  0.313614      1
1       2  0.765079      1
2       3  0.153851      1
3       4  0.792098      2
4       5  0.123700      2
5       6  0.239107      2
6       7  0.133665      3
7       8  0.979318      3
8       5  0.781948      3
9      45  0.264344      4
10  64545  0.495561      4
11     65  0.504734      4
12   6456  0.766627      5
13    564  0.428423      5

Answer 4

Here is an approach that "manually" computes the extent of the bins, based on the requested number bins :

bins = 5

l = len(df)
minbinlen = l // bins
remainder = l % bins
repeats = np.repeat(minbinlen, bins)
repeats[:remainder] += 1
group = np.repeat(range(bins), repeats) + 1

df['group'] = group

Result:

        x         y  group
0       1  0.205168      1
1       2  0.105466      1
2       3  0.545794      1
3       4  0.639346      2
4       5  0.758056      2
5       6  0.982090      2
6       7  0.942849      3
7       8  0.284520      3
8       5  0.491151      3
9      45  0.731265      4
10  64545  0.072668      4
11     65  0.601416      4
12   6456  0.239454      5
13    564  0.345006      5

This seems to follow the splitting logic of np.array_split (ie try to evenly split the bins, but add onto earlier bins if that isn't possible).

While the code is less concise, it doesn't use any loops, so it theoretically should be faster with larger amounts of data.

Just because I was curious, going to leave this perfplot testing here...

import numpy as np
import pandas as pd
import perfplot

def make_data(n):
    x = np.random.rand(n)
    y = np.random.rand(n)
    df_dict = {'x': x, 'y': y}
    df = pd.DataFrame(df_dict)

    return df

def repeat(df, bins=5):
    l = len(df)
    minbinlen = l // bins
    remainder = l % bins
    repeats = np.repeat(minbinlen, bins)
    repeats[:remainder] += 1
    group = np.repeat(range(bins), repeats) + 1

    return group

def near_split(base, num_bins):
    quotient, remainder = divmod(base, num_bins)
    return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)

def array_split(df, bins=5):
    splits = np.array_split(df, bins)

    for i in range(len(splits)):
        splits[i]['group'] = i + 1

    return pd.concat(splits)

perfplot.show(
    setup = lambda n : make_data(n),
    kernels = [
        lambda df: repeat(df),
        lambda df: [i + 1 for i, v in enumerate(near_split(len(df), 5)) for _ in range(v)],
        lambda df: df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1,
        lambda df: array_split(df)
        ],
    labels=['repeat', 'near_split', 'groupby', 'array_split'],
    n_range=[2 ** k for k in range(25)],
    equality_check=None)

How do I make bins of equal number of observations in a pandas dataframe?

Question

4 answers

solution1
3 ACCPTED 2021-05-06 19:36:11

solution2
2 2021-05-06 19:33:55

solution3
2 2021-05-06 19:43:34

solution4
1 2021-05-06 20:25:30

How do I make bins of equal number of observations in a pandas dataframe?

Question

4 answers

solution1 3 ACCPTED 2021-05-06 19:36:11

solution2 2 2021-05-06 19:33:55

solution3 2 2021-05-06 19:43:34

solution4 1 2021-05-06 20:25:30

solution1
3 ACCPTED 2021-05-06 19:36:11

solution2
2 2021-05-06 19:33:55

solution3
2 2021-05-06 19:43:34

solution4
1 2021-05-06 20:25:30