从多列（包括自动生成的列）对数据框中的数据进行分组

Question

I am trying to create an index from two columns in a pandas dataframe.我正在尝试从 Pandas 数据框中的两列创建索引。 However, I first want to 'bucket' the values in one of the columns, before using the 'bucketed' values in the index.但是，在使用索引中的“分桶”值之前，我首先要“分桶”其中一列中的值。

The code below should help explain further:下面的代码应该有助于进一步解释：

import numpy as np
import pandas as pd

# No error checking, pseudocode  ...
def bucket_generator(source_data, colname, step_size):
    # create bucket column (string)
    source_data['bucket'] = ''

    # obtain the series to operate on
    series = source_data['colname']

    # determine which bucket number each cell in series would belong to,
    # by dividing the cell value by the step_size

    # Naive way would be to iterate over cells in series, generating a 
    # bucket label like "bucket_{0:+}".format(cell_value/step_size),
    # then stick it in a cell in the bucket column, but there must be a more
    # 'dataframe' way of doing it, rather than looping





data = {'a': (10,3,5,7,15,20,10,3,5,7,19,5,7,5,10,5,3,7,20,20),
        'b': (98.5,107.2,350,211.2,120.5,-70.8,135.9,205.1,-12.8,280.5,-19.7,77.2,88.2,69.2,101.2,-302.
        4,-79.8,-257.6,89.6,95.7),
        'c': (12.5,23.4,11.5,45.2,17.6,19.5,0.25,33.6,18.9,6.5,12.5,26.2,5.2,0.3,7.2,8.9,2.1,3.1,19.1,2
       0.2)
      }

df = pd.DataFrame(data)

df

     a      b      c
0   10   98.5  12.50
1    3  107.2  23.40
2    5  350.0  11.50
3    7  211.2  45.20
4   15  120.5  17.60
5   20  -70.8  19.50
6   10  135.9   0.25
7    3  205.1  33.60
8    5  -12.8  18.90
9    7  280.5   6.50
10  19  -19.7  12.50
11   5   77.2  26.20
12   7   88.2   5.20
13   5   69.2   0.30
14  10  101.2   7.20
15   5 -302.4   8.90
16   3  -79.8   2.10
17   7 -257.6   3.10
18  20   89.6  19.10
19  20   95.7  20.20

This is what I want to do:这就是我想要做的：

Correctly implement function bucket_generator正确实现功能bucket_generator
Group the dataframe data by cols 'a' THEN 'bucket' label按 cols 'a' THEN 'bucket' 标签对数据帧数据进行分组
Select rows from the dataframe for a given value (integer) in the 'a' column AND bucket 'label in the bucket column.从数据框中为“a”列中的给定值（整数）和存储桶列中的存储桶“标签”选择行。

Answer 1

New Answer新答案

Focusing on what OP asked for专注于 OP 的要求

def bucket_generator(source_data, colname, step_size):
    series = source_data[colname]
    source_data['bucket'] = 'bucket_' + (series // step_size).astype(int).astype(str)

data = {'a': (10,3,5,7,15,20,10,3,5,7,19,5,7,5,10,5,3,7,20,20),
        'b': (98.5,107.2,350,211.2,120.5,-70.8,135.9,205.1,-12.8,280.5,-19.7,77.2,88.2,69.2,101.2,-302.4,-79.8,-257.6,89.6,95.7),
        'c': (12.5,23.4,11.5,45.2,17.6,19.5,0.25,33.6,18.9,6.5,12.5,26.2,5.2,0.3,7.2,8.9,2.1,3.1,19.1,20.2)
      }

df = pd.DataFrame(data)
bucket_generator(df, 'a', 5)

df1 = df.set_index(['a', 'bucket']).sort_index(kind='mergesort')
print(df1.xs((3, 'bucket_0')).reset_index())

dob = {bucket: group for bucket, group in df.groupby(['a', 'bucket'])}
print(dob[(3, 'bucket_0')])

   a    bucket      b     c
0  3  bucket_0  107.2  23.4
1  3  bucket_0  205.1  33.6
2  3  bucket_0  -79.8   2.1
    a      b     c    bucket
1   3  107.2  23.4  bucket_0
7   3  205.1  33.6  bucket_0
16  3  -79.8   2.1  bucket_0

Old Answer旧答案

Assign to the index of df a list of the levels you want as index levels.将您想要作为索引级别的级别列表分配给df的索引。
Use pd.qcut to help with the bucketizing使用pd.qcut帮助进行分桶
Use a list comprehension to help with the labeling使用列表理解来帮助标记

def enlabeler(s, n):
    return ['{}_{}'.format(s, i) for i in range(n)]

df.index = [
    pd.qcut(df.a, 3, enlabeler('a', 3)),
    pd.qcut(df.b, 3, enlabeler('b', 3)),
    pd.qcut(df.c, 3, enlabeler('c', 3))
]

print(df)

A little more dynamically and with a subset of columns更动态一点，并带有列的子集

def enlabeler(s, n):
    return ['{}_{}'.format(s, i) for i in range(n)]

def cutcol(c, n):
    return pd.qcut(c, n, enlabeler(c.name, n))

df.index = df[['a', 'b']].apply(cutcol, n=3).values.T.tolist()

从多列（包括自动生成的列）对数据框中的数据进行分组

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-01-14 23:38:28

New Answer新答案

Old Answer旧答案

从多列（包括自动生成的列）对数据框中的数据进行分组

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-01-14 23:38:28

New Answer新答案

Old Answer旧答案

解决方案1
1 已采纳 2017-01-14 23:38:28