简体   繁体   中英

How do I make bins of equal number of observations in a pandas dataframe?

I'm trying to make a column in a dataframe depicting a group or bin that observation belongs to. The idea is to sort the dataframe according to some column, then develop another column denoting which bin that observation belongs to. If I want deciles, then I should be able to tell a function I want 10 equal (or close to equal) groups.

I tried the pandas qcut but that just gives a tuples of the the upper and lower limits of the bins. I would like just 1,2,3,4....etc. Take the following for example

import numpy as np
import pandas as pd

x = [1,2,3,4,5,6,7,8,5,45,64545,65,6456,564]
y = np.random.rand(len(x))

df_dict = {'x': x, 'y': y}
df = pd.DataFrame(df_dict)

This gives a df of 14 observations. How could I get groups of 5 equal bins?

The desired result would be the following:

        x         y  group
0       1  0.926273      1
1       2  0.678101      1
2       3  0.636875      1
3       4  0.802590      2
4       5  0.494553      2
5       6  0.874876      2
6       7  0.607902      3
7       8  0.028737      3
8       5  0.493545      3
9      45  0.498140      4
10  64545  0.938377      4
11     65  0.613015      4
12   6456  0.288266      5
13    564  0.917817      5

Group by N rows, and find ngroup

df['group']=df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1



     x      y        group
0       1  0.548801      1
1       2  0.096620      1
2       3  0.713771      1
3       4  0.922987      2
4       5  0.283689      2
5       6  0.807755      2
6       7  0.592864      3
7       8  0.670315      3
8       5  0.034549      3
9      45  0.355274      4
10  64545  0.239373      4
11     65  0.156208      4
12   6456  0.419990      5
13    564  0.248278      5

You can split evenly with np.array_split() , assign the groups, then recombine withpd.concat() :

bins = 5
splits = np.array_split(df, bins)

for i in range(len(splits)):
    splits[i]['group'] = i + 1

df = pd.concat(splits)

Or as a one-liner using assign() :

df = pd.concat([d.assign(group=i+1) for i, d in enumerate(np.array_split(df, bins))])
        x         y  group
0       1  0.145781      1
1       2  0.262097      1
2       3  0.114799      1
3       4  0.275054      2
4       5  0.841606      2
5       6  0.187210      2
6       7  0.582487      3
7       8  0.019881      3
8       5  0.847115      3
9      45  0.755606      4
10  64545  0.196705      4
11     65  0.688639      4
12   6456  0.275884      5
13    564  0.579946      5

Another option by generating list of indexes from near_split :

def near_split(base, num_bins):
    quotient, remainder = divmod(base, num_bins)
    return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)


bins = 5
df['group'] = [i + 1 for i, v in enumerate(near_split(len(df), bins)) for _ in range(v)]
print(df)

Output:

        x         y  group
0       1  0.313614      1
1       2  0.765079      1
2       3  0.153851      1
3       4  0.792098      2
4       5  0.123700      2
5       6  0.239107      2
6       7  0.133665      3
7       8  0.979318      3
8       5  0.781948      3
9      45  0.264344      4
10  64545  0.495561      4
11     65  0.504734      4
12   6456  0.766627      5
13    564  0.428423      5

Here is an approach that "manually" computes the extent of the bins, based on the requested number bins :

bins = 5

l = len(df)
minbinlen = l // bins
remainder = l % bins
repeats = np.repeat(minbinlen, bins)
repeats[:remainder] += 1
group = np.repeat(range(bins), repeats) + 1

df['group'] = group

Result:

        x         y  group
0       1  0.205168      1
1       2  0.105466      1
2       3  0.545794      1
3       4  0.639346      2
4       5  0.758056      2
5       6  0.982090      2
6       7  0.942849      3
7       8  0.284520      3
8       5  0.491151      3
9      45  0.731265      4
10  64545  0.072668      4
11     65  0.601416      4
12   6456  0.239454      5
13    564  0.345006      5

This seems to follow the splitting logic of np.array_split (ie try to evenly split the bins, but add onto earlier bins if that isn't possible).

While the code is less concise, it doesn't use any loops, so it theoretically should be faster with larger amounts of data.

Just because I was curious, going to leave this perfplot testing here...

在此处输入图像描述

import numpy as np
import pandas as pd
import perfplot

def make_data(n):
    x = np.random.rand(n)
    y = np.random.rand(n)
    df_dict = {'x': x, 'y': y}
    df = pd.DataFrame(df_dict)

    return df

def repeat(df, bins=5):
    l = len(df)
    minbinlen = l // bins
    remainder = l % bins
    repeats = np.repeat(minbinlen, bins)
    repeats[:remainder] += 1
    group = np.repeat(range(bins), repeats) + 1

    return group

def near_split(base, num_bins):
    quotient, remainder = divmod(base, num_bins)
    return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)

def array_split(df, bins=5):
    splits = np.array_split(df, bins)

    for i in range(len(splits)):
        splits[i]['group'] = i + 1

    return pd.concat(splits)

perfplot.show(
    setup = lambda n : make_data(n),
    kernels = [
        lambda df: repeat(df),
        lambda df: [i + 1 for i, v in enumerate(near_split(len(df), 5)) for _ in range(v)],
        lambda df: df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1,
        lambda df: array_split(df)
        ],
    labels=['repeat', 'near_split', 'groupby', 'array_split'],
    n_range=[2 ** k for k in range(25)],
    equality_check=None)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM