Pandas：为 groupby 标识的每个组分配一个索引

Question

使用 groupby() 时，如何使用包含组编号索引的新列创建 DataFrame，类似于 R 中的dplyr::group_indices 。例如，如果我有

>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
   a  b
0  1  1
1  1  1
2  1  2
3  2  1
4  2  1
5  2  2

我怎样才能得到一个像

（ idx索引的顺序无关紧要）

Answer 1

这是使用上面来自 Constantino 的评论中的ngroup （从 pandas 0.20.2 开始可用）的解决方案，对于那些仍在寻找此功能的人（相当于 R 中的dplyr::group_indices ，或者 Stata 中的egen group()如果您是尝试使用像我这样的关键字进行搜索）。 这也比 maxliving 根据我自己的时间给出的解决方案快约 25%。

>>> import pandas as pd
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df['idx'] = df.groupby(['a', 'b']).ngroup()
>>> df
   a  b  idx
0  1  1    0
1  1  1    0
2  1  2    1
3  2  1    2
4  2  1    2
5  2  2    3

>>> %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
1.83 ms ± 67.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['idx'] = df.groupby(['a', 'b']).ngroup()
1.38 ms ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Answer 2

这是使用drop_duplicates和merge来获取唯一标识符的简洁方法。

group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )

   a  b  index
0  1  1      0
1  1  1      0
2  1  2      2
3  2  1      3
4  2  1      3
5  2  2      5

在这种情况下，标识符变为 0,2,3,5（只是原始索引的残差），但可以通过额外的reset_index(drop=True)轻松更改为 0,1,2,3。

更新：较新版本的熊猫 (0.20.2) 提供了一种使用ngroup方法执行此操作的更简单方法，如ngroup对上述问题的评论和@CalumYou 的后续回答中所述。 我将把它留在这里作为一种替代方法，但在大多数情况下， ngroup似乎是更好的方法。

Answer 3

一种简单的方法是连接您的分组列（以便它们的值的每个组合代表一个独特的不同元素），然后将其转换为熊猫分类并仅保留其标签：

df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df

    a   b   idx
0   1   1   0
1   1   1   0
2   1   2   1
3   2   1   2
4   2   1   2
5   2   2   3

编辑：将labels属性更改为codes因为前者似乎已被弃用

Edit2：添加了 Authman Apatira 建议的分隔符

Answer 4

绝对不是最直接的解决方案，但这是我要做的（代码中的注释）：

df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})

#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)

print df

这将为a和b每个组合生成一个唯一的 idx 。

   a  b idx
0  1  1  11
1  1  1  11
2  1  2  12
3  2  1  21
4  2  1  21
5  2  2  22

但这仍然是一个相当愚蠢的索引（想想a和b列中一些更复杂的值。所以让我们清除索引：

# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))

# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}

#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)

print df

这将产生所需的输出：

   a  b  idx
0  1  1    0
1  1  1    0
2  1  2    1
3  2  1    2
4  2  1    2
5  2  2    3

Answer 5

我认为比当前接受的答案快一个数量级的方法（下面的计时结果）：

def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
    df.sort_values(grouping_cols, inplace=True)
    # You could do the following three lines in one, I just thought 
    # this would be clearer as an explanation of what's going on:
    duplicated = df.duplicated(subset=grouping_cols, keep='first')
    new_group = ~duplicated
    return new_group.cumsum()

计时结果：

a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})

In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop

In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop

Answer 6

我不确定这是一个微不足道的问题。 这是一个有点复杂的解决方案，它首先对分组列进行排序，然后检查每一行是否与前一行不同，如果不同则累加 1。进一步检查下面的字符串数据答案。

df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)

输出

0    1
1    1
2    2
3    3
4    3
5    4
dtype: int64

所以把它分解成几个步骤，让我们看看df.sort_values(['a', 'b']).diff().fillna(0) ，它检查每一行是否与前一行不同。 任何非零条目表示一个新组。

     a    b
0  0.0  0.0
1  0.0  0.0
2  0.0  1.0
3  1.0 -1.0
4  0.0  0.0
5  0.0  1.0

一个新组只需要有一个不同的列，所以这是.ne(0).any(1)检查的内容 - 对于任何列都不等于 0。 然后只是一个累积总和来跟踪组。

将列作为字符串的答案

#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])

df1输出

   a  b
0  a  a
1  a  a
4  a  a
3  b  a
2  b  b
5  c  c
6  c  d
8  c  d
7  d  d

通过检查组是否已更改采取类似的方法

df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)

0    1
1    1
4    1
3    2
2    3
5    4
6    5
8    5
7    6

Pandas：为 groupby 标识的每个组分配一个索引

问题描述

6 个解决方案

解决方案1
33 2018-07-13 17:07:41

解决方案2
17 已采纳 2017-01-13 15:45:41

解决方案3
15 2017-01-11 16:07:27

解决方案4
2 2017-01-11 16:05:23

解决方案5
2 2017-09-13 19:46:01

解决方案6
1 2017-01-11 15:49:53

将列作为字符串的答案

Pandas：为 groupby 标识的每个组分配一个索引

问题描述

6 个解决方案

解决方案1 33 2018-07-13 17:07:41

解决方案2 17 已采纳 2017-01-13 15:45:41

解决方案3 15 2017-01-11 16:07:27

解决方案4 2 2017-01-11 16:05:23

解决方案5 2 2017-09-13 19:46:01

解决方案6 1 2017-01-11 15:49:53

将列作为字符串的答案

解决方案1
33 2018-07-13 17:07:41

解决方案2
17 已采纳 2017-01-13 15:45:41

解决方案3
15 2017-01-11 16:07:27

解决方案4
2 2017-01-11 16:05:23

解决方案5
2 2017-09-13 19:46:01

解决方案6
1 2017-01-11 15:49:53