pandas.DataFrame-如何按组重新索引？

Question

Can new index be applied to DF, respectively to grouping made with groupby ? 是否可以将新索引分别应用于DF和groupby进行的分组？ Precisely - is there an elegant way to do that, and can original DF be changed through groupby groups at all? 准确地讲-是否有一种优雅的方法可以通过groupby组更改原始DF？

UPD: My data looks like this: UPD：我的数据如下所示：

   A  B         C
0  a  x  0.903343
1  a  z  0.982050
2  g  x  0.274823
3  g  y  0.334491
4  c  z  0.756728
5  f  z  0.697841
6  d  z  0.505845
7  b  z  0.768199
8  b  y  0.743012
9  e  x  0.697212

I grouping by columns 'A' and 'B', and I want that every unique pair of values of that columns would have same index value in original DF. 我按列“ A”和“ B”分组，并且希望该列的每一对唯一值在原始DF中都具有相同的索引值。 Also - original DF can be big, and Im trying to figure how to make such reindex without inefficient forming whole new DF. 另外-原始DF可能很大，而Im试图弄清楚如何进行这样的重新索引而不会低效率地形成全新的DF。

Currently Im using this solution: 目前我正在使用此解决方案：

df = pd.DataFrame({'A': [random.choice(ascii_lowercase[:5]) for _ in xrange(10)],
                    'B': [random.choice(['x', 'y']) for _ in xrange(10)],
                    'C': [random.random() for _ in xrange(10)]})

df['id'] = None
new_df = pd.DataFrame()
for i, (n, g) in enumerate(df.groupby(['A', 'B'])):
    g['id'] = i
    new_df = new_df.append(g)

new_df.set_index('id', inplace=True)

Answer 1

You can do this quickly with some internal function in pandas: 您可以使用pandas的一些内部功能快速完成此操作：

Create test DataFrame first: 首先创建测试DataFrame：

import pandas as pd
import random
random.seed(1)
df = pd.DataFrame({'A': [random.choice(ascii_lowercase[:5]) for _ in xrange(10)],
                    'B': [random.choice(['x', 'y']) for _ in xrange(10)],
                    'C': [random.random() for _ in xrange(10)]})

If you want the new id the same order as column A & B: 如果您希望新ID与A和B列相同的顺序：

m = pd.MultiIndex.from_arrays((df.A, df.B))
df.index = pd.factorize(pd.lib.fast_zip(m.labels), sort=True)[0]
print df

The output is: 输出为：

   A  B         C
1  a  y  0.025446
7  e  x  0.541412
6  d  y  0.939149
2  b  x  0.381204
3  c  x  0.216599
4  c  y  0.422117
5  d  x  0.029041
6  d  y  0.221692
1  a  y  0.437888
0  a  x  0.495812

If you don't care the order of new id: 如果您不关心新ID的顺序：

m = pd.MultiIndex.from_arrays((df.A, df.B))
la, lb = m.labels
df.index = pd.factorize(la*len(lb)+lb)[0]
print df

The output is: 输出为：

  A  B         C
0  a  y  0.025446
1  e  x  0.541412
2  d  y  0.939149
3  b  x  0.381204
4  c  x  0.216599
5  c  y  0.422117
6  d  x  0.029041
2  d  y  0.221692
0  a  y  0.437888
7  a  x  0.495812

pandas.DataFrame-如何按组重新索引？

问题描述

1 个解决方案

解决方案1
0 2013-03-13 01:42:55

pandas.DataFrame-如何按组重新索引？

问题描述

1 个解决方案

解决方案1 0 2013-03-13 01:42:55

解决方案1
0 2013-03-13 01:42:55