[英]Grouping data in a dataframe from multiple columns (including an autogenerated column)
I am trying to create an index from two columns in a pandas dataframe.我正在尝试从 Pandas 数据框中的两列创建索引。 However, I first want to 'bucket' the values in one of the columns, before using the 'bucketed' values in the index.但是,在使用索引中的“分桶”值之前,我首先要“分桶”其中一列中的值。
The code below should help explain further:下面的代码应该有助于进一步解释:
import numpy as np
import pandas as pd
# No error checking, pseudocode ...
def bucket_generator(source_data, colname, step_size):
# create bucket column (string)
source_data['bucket'] = ''
# obtain the series to operate on
series = source_data['colname']
# determine which bucket number each cell in series would belong to,
# by dividing the cell value by the step_size
# Naive way would be to iterate over cells in series, generating a
# bucket label like "bucket_{0:+}".format(cell_value/step_size),
# then stick it in a cell in the bucket column, but there must be a more
# 'dataframe' way of doing it, rather than looping
data = {'a': (10,3,5,7,15,20,10,3,5,7,19,5,7,5,10,5,3,7,20,20),
'b': (98.5,107.2,350,211.2,120.5,-70.8,135.9,205.1,-12.8,280.5,-19.7,77.2,88.2,69.2,101.2,-302.
4,-79.8,-257.6,89.6,95.7),
'c': (12.5,23.4,11.5,45.2,17.6,19.5,0.25,33.6,18.9,6.5,12.5,26.2,5.2,0.3,7.2,8.9,2.1,3.1,19.1,2
0.2)
}
df = pd.DataFrame(data)
df
a b c
0 10 98.5 12.50
1 3 107.2 23.40
2 5 350.0 11.50
3 7 211.2 45.20
4 15 120.5 17.60
5 20 -70.8 19.50
6 10 135.9 0.25
7 3 205.1 33.60
8 5 -12.8 18.90
9 7 280.5 6.50
10 19 -19.7 12.50
11 5 77.2 26.20
12 7 88.2 5.20
13 5 69.2 0.30
14 10 101.2 7.20
15 5 -302.4 8.90
16 3 -79.8 2.10
17 7 -257.6 3.10
18 20 89.6 19.10
19 20 95.7 20.20
This is what I want to do:这就是我想要做的:
bucket_generator
正确实现功能bucket_generator
Focusing on what OP asked for专注于 OP 的要求
def bucket_generator(source_data, colname, step_size):
series = source_data[colname]
source_data['bucket'] = 'bucket_' + (series // step_size).astype(int).astype(str)
data = {'a': (10,3,5,7,15,20,10,3,5,7,19,5,7,5,10,5,3,7,20,20),
'b': (98.5,107.2,350,211.2,120.5,-70.8,135.9,205.1,-12.8,280.5,-19.7,77.2,88.2,69.2,101.2,-302.4,-79.8,-257.6,89.6,95.7),
'c': (12.5,23.4,11.5,45.2,17.6,19.5,0.25,33.6,18.9,6.5,12.5,26.2,5.2,0.3,7.2,8.9,2.1,3.1,19.1,20.2)
}
df = pd.DataFrame(data)
bucket_generator(df, 'a', 5)
df1 = df.set_index(['a', 'bucket']).sort_index(kind='mergesort')
print(df1.xs((3, 'bucket_0')).reset_index())
dob = {bucket: group for bucket, group in df.groupby(['a', 'bucket'])}
print(dob[(3, 'bucket_0')])
a bucket b c
0 3 bucket_0 107.2 23.4
1 3 bucket_0 205.1 33.6
2 3 bucket_0 -79.8 2.1
a b c bucket
1 3 107.2 23.4 bucket_0
7 3 205.1 33.6 bucket_0
16 3 -79.8 2.1 bucket_0
df
a list of the levels you want as index levels.将您想要作为索引级别的级别列表分配给df
的索引。pd.qcut
to help with the bucketizing使用pd.qcut
帮助进行分桶def enlabeler(s, n):
return ['{}_{}'.format(s, i) for i in range(n)]
df.index = [
pd.qcut(df.a, 3, enlabeler('a', 3)),
pd.qcut(df.b, 3, enlabeler('b', 3)),
pd.qcut(df.c, 3, enlabeler('c', 3))
]
print(df)
A little more dynamically and with a subset of columns更动态一点,并带有列的子集
def enlabeler(s, n):
return ['{}_{}'.format(s, i) for i in range(n)]
def cutcol(c, n):
return pd.qcut(c, n, enlabeler(c.name, n))
df.index = df[['a', 'b']].apply(cutcol, n=3).values.T.tolist()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.