[英]pandas.DataFrame - how to reindex by group?
Can new index be applied to DF, respectively to grouping made with groupby
? 是否可以将新索引分别应用于DF和
groupby
进行的分组? Precisely - is there an elegant way to do that, and can original DF be changed through groupby
groups at all? 准确地讲-是否有一种优雅的方法可以通过
groupby
组更改原始DF?
UPD: My data looks like this: UPD:我的数据如下所示:
A B C
0 a x 0.903343
1 a z 0.982050
2 g x 0.274823
3 g y 0.334491
4 c z 0.756728
5 f z 0.697841
6 d z 0.505845
7 b z 0.768199
8 b y 0.743012
9 e x 0.697212
I grouping by columns 'A' and 'B', and I want that every unique pair of values of that columns would have same index value in original DF. 我按列“ A”和“ B”分组,并且希望该列的每一对唯一值在原始DF中都具有相同的索引值。 Also - original DF can be big, and Im trying to figure how to make such reindex without inefficient forming whole new DF.
另外-原始DF可能很大,而Im试图弄清楚如何进行这样的重新索引而不会低效率地形成全新的DF。
Currently Im using this solution: 目前我正在使用此解决方案:
df = pd.DataFrame({'A': [random.choice(ascii_lowercase[:5]) for _ in xrange(10)],
'B': [random.choice(['x', 'y']) for _ in xrange(10)],
'C': [random.random() for _ in xrange(10)]})
df['id'] = None
new_df = pd.DataFrame()
for i, (n, g) in enumerate(df.groupby(['A', 'B'])):
g['id'] = i
new_df = new_df.append(g)
new_df.set_index('id', inplace=True)
You can do this quickly with some internal function in pandas: 您可以使用pandas的一些内部功能快速完成此操作:
Create test DataFrame first: 首先创建测试DataFrame:
import pandas as pd
import random
random.seed(1)
df = pd.DataFrame({'A': [random.choice(ascii_lowercase[:5]) for _ in xrange(10)],
'B': [random.choice(['x', 'y']) for _ in xrange(10)],
'C': [random.random() for _ in xrange(10)]})
If you want the new id the same order as column A & B: 如果您希望新ID与A和B列相同的顺序:
m = pd.MultiIndex.from_arrays((df.A, df.B))
df.index = pd.factorize(pd.lib.fast_zip(m.labels), sort=True)[0]
print df
The output is: 输出为:
A B C
1 a y 0.025446
7 e x 0.541412
6 d y 0.939149
2 b x 0.381204
3 c x 0.216599
4 c y 0.422117
5 d x 0.029041
6 d y 0.221692
1 a y 0.437888
0 a x 0.495812
If you don't care the order of new id: 如果您不关心新ID的顺序:
m = pd.MultiIndex.from_arrays((df.A, df.B))
la, lb = m.labels
df.index = pd.factorize(la*len(lb)+lb)[0]
print df
The output is: 输出为:
A B C
0 a y 0.025446
1 e x 0.541412
2 d y 0.939149
3 b x 0.381204
4 c x 0.216599
5 c y 0.422117
6 d x 0.029041
2 d y 0.221692
0 a y 0.437888
7 a x 0.495812
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.