简体   繁体   English

从多列(包括自动生成的列)对数据框中的数据进行分组

[英]Grouping data in a dataframe from multiple columns (including an autogenerated column)

I am trying to create an index from two columns in a pandas dataframe.我正在尝试从 Pandas 数据框中的两列创建索引。 However, I first want to 'bucket' the values in one of the columns, before using the 'bucketed' values in the index.但是,在使用索引中的“分桶”值之前,我首先要“分桶”其中一列中的值。

The code below should help explain further:下面的代码应该有助于进一步解释:

import numpy as np
import pandas as pd

# No error checking, pseudocode  ...
def bucket_generator(source_data, colname, step_size):
    # create bucket column (string)
    source_data['bucket'] = ''

    # obtain the series to operate on
    series = source_data['colname']

    # determine which bucket number each cell in series would belong to,
    # by dividing the cell value by the step_size

    # Naive way would be to iterate over cells in series, generating a 
    # bucket label like "bucket_{0:+}".format(cell_value/step_size),
    # then stick it in a cell in the bucket column, but there must be a more
    # 'dataframe' way of doing it, rather than looping





data = {'a': (10,3,5,7,15,20,10,3,5,7,19,5,7,5,10,5,3,7,20,20),
        'b': (98.5,107.2,350,211.2,120.5,-70.8,135.9,205.1,-12.8,280.5,-19.7,77.2,88.2,69.2,101.2,-302.
        4,-79.8,-257.6,89.6,95.7),
        'c': (12.5,23.4,11.5,45.2,17.6,19.5,0.25,33.6,18.9,6.5,12.5,26.2,5.2,0.3,7.2,8.9,2.1,3.1,19.1,2
       0.2)
      }

df = pd.DataFrame(data)

df

     a      b      c
0   10   98.5  12.50
1    3  107.2  23.40
2    5  350.0  11.50
3    7  211.2  45.20
4   15  120.5  17.60
5   20  -70.8  19.50
6   10  135.9   0.25
7    3  205.1  33.60
8    5  -12.8  18.90
9    7  280.5   6.50
10  19  -19.7  12.50
11   5   77.2  26.20
12   7   88.2   5.20
13   5   69.2   0.30
14  10  101.2   7.20
15   5 -302.4   8.90
16   3  -79.8   2.10
17   7 -257.6   3.10
18  20   89.6  19.10
19  20   95.7  20.20

This is what I want to do:这就是我想要做的:

  1. Correctly implement function bucket_generator正确实现功能bucket_generator
  2. Group the dataframe data by cols 'a' THEN 'bucket' label按 cols 'a' THEN 'bucket' 标签对数据帧数据进行分组
  3. Select rows from the dataframe for a given value (integer) in the 'a' column AND bucket 'label in the bucket column.从数据框中为“a”列中的给定值(整数)和存储桶列中的存储桶“标签”选择行。

New Answer新答案

Focusing on what OP asked for专注于 OP 的要求

def bucket_generator(source_data, colname, step_size):
    series = source_data[colname]
    source_data['bucket'] = 'bucket_' + (series // step_size).astype(int).astype(str)

data = {'a': (10,3,5,7,15,20,10,3,5,7,19,5,7,5,10,5,3,7,20,20),
        'b': (98.5,107.2,350,211.2,120.5,-70.8,135.9,205.1,-12.8,280.5,-19.7,77.2,88.2,69.2,101.2,-302.4,-79.8,-257.6,89.6,95.7),
        'c': (12.5,23.4,11.5,45.2,17.6,19.5,0.25,33.6,18.9,6.5,12.5,26.2,5.2,0.3,7.2,8.9,2.1,3.1,19.1,20.2)
      }

df = pd.DataFrame(data)
bucket_generator(df, 'a', 5)

df1 = df.set_index(['a', 'bucket']).sort_index(kind='mergesort')
print(df1.xs((3, 'bucket_0')).reset_index())

dob = {bucket: group for bucket, group in df.groupby(['a', 'bucket'])}
print(dob[(3, 'bucket_0')])

   a    bucket      b     c
0  3  bucket_0  107.2  23.4
1  3  bucket_0  205.1  33.6
2  3  bucket_0  -79.8   2.1
    a      b     c    bucket
1   3  107.2  23.4  bucket_0
7   3  205.1  33.6  bucket_0
16  3  -79.8   2.1  bucket_0

Old Answer旧答案

  • Assign to the index of df a list of the levels you want as index levels.将您想要作为索引级别的级别列表分配给df的索引。
  • Use pd.qcut to help with the bucketizing使用pd.qcut帮助进行分桶
  • Use a list comprehension to help with the labeling使用列表理解来帮助标记

def enlabeler(s, n):
    return ['{}_{}'.format(s, i) for i in range(n)]

df.index = [
    pd.qcut(df.a, 3, enlabeler('a', 3)),
    pd.qcut(df.b, 3, enlabeler('b', 3)),
    pd.qcut(df.c, 3, enlabeler('c', 3))
]

print(df)

在此处输入图片说明


A little more dynamically and with a subset of columns更动态一点,并带有列的子集

def enlabeler(s, n):
    return ['{}_{}'.format(s, i) for i in range(n)]

def cutcol(c, n):
    return pd.qcut(c, n, enlabeler(c.name, n))

df.index = df[['a', 'b']].apply(cutcol, n=3).values.T.tolist()

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 需要帮助将数据从多列分组到 Python 中的索引列 - Need help grouping data from multiple columns to an Index column in Python 根据其他行和列的多个条件在数据框中创建新列? 包括空行? - 蟒蛇/熊猫 - Creating a new column in dataframe based on multiple conditions from other rows and columns? Including rows that are null? - Python/Pandas 将 Pandas DataFrame 与多列分组 - Grouping pandas DataFrame with Multiple Columns 对 DataFrame 中的多列进行分组和求和 - Grouping and Summing Multiple Columns in a DataFrame 从包括地图列的数据框中获取列的总和 - PySpark - Get sum of columns from a dataframe including map column - PySpark 从多个列的另一个数据帧列中减去一个数据帧列 - Subtracting one dataframe column from another dataframe column for multiple columns 从 dataframe 中的多列创建单列 - Create a single column from multiple columns in a dataframe 如何将数据从 Pandas 数据帧的一列拆分为新数据帧的多列 - How do I split data out from one column of a pandas dataframe into multiple columns of a new dataframe 对包含具有列位置的列的 Python Dataframe 进行分组 - Grouping a Python Dataframe containing columns with column locations 将数据框中的多个列中的数据分组到摘要视图中 - Grouping data from multiple columns in data frame into summary view
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM