Groupby 值计数在 dataframe pandas

Question

I have the following dataframe:我有以下 dataframe：

df = pd.DataFrame([
    (1, 1, 'term1'),
    (1, 2, 'term2'),
    (1, 1, 'term1'),
    (1, 1, 'term2'),
    (2, 2, 'term3'),
    (2, 3, 'term1'),
    (2, 2, 'term1')
], columns=['id', 'group', 'term'])

I want to group it by id and group and calculate the number of each term for this id, group pair.我想按id和group对它进行分组，并计算这个 id，group 对的每个术语的数量。

So in the end I am going to get something like this:所以最后我会得到这样的东西：

I was able to achieve what I want by looping over all the rows with df.iterrows() and creating a new dataframe, but this is clearly inefficient.我能够通过使用df.iterrows()遍历所有行并创建一个新的 dataframe 来实现我想要的，但这显然效率低下。 (If it helps, I know the list of all terms beforehand and there are ~10 of them). （如果有帮助，我事先知道所有术语的列表，其中约有 10 个）。

It looks like I have to group by and then count values, so I tried that with df.groupby(['id', 'group']).value_counts() which does not work because value_counts operates on the groupby series and not a dataframe.看起来我必须分组然后计算值，所以我尝试使用df.groupby(['id', 'group']).value_counts()这不起作用，因为value_counts在 groupby 系列上运行而不是dataframe。

Anyway I can achieve this without looping?无论如何，我可以在不循环的情况下实现这一目标吗？

Answer 1

I use groupby and size我使用groupby和size

df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)

Timing定时

1,000,000 rows 1,000,000 行

df = pd.DataFrame(dict(id=np.random.choice(100, 1000000),
                       group=np.random.choice(20, 1000000),
                       term=np.random.choice(10, 1000000)))

Answer 2

using pivot_table() method:使用pivot_table()方法：

In [22]: df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
Out[22]:
term      term1  term2  term3
id group
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Timing against 700K rows DF:针对 700K 行 DF 的计时：

In [24]: df = pd.concat([df] * 10**5, ignore_index=True)

In [25]: df.shape
Out[25]: (700000, 3)

In [3]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)
1 loop, best of 3: 226 ms per loop

In [4]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
1 loop, best of 3: 236 ms per loop

In [5]: %timeit pd.crosstab([df.id, df.group], df.term)
1 loop, best of 3: 355 ms per loop

In [6]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int)
1 loop, best of 3: 232 ms per loop

In [7]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1 loop, best of 3: 231 ms per loop

Timing against 7M rows DF:针对 7M 行 DF 的计时：

In [9]: df = pd.concat([df] * 10, ignore_index=True)

In [10]: df.shape
Out[10]: (7000000, 3)

In [11]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)
1 loop, best of 3: 2.27 s per loop

In [12]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
1 loop, best of 3: 2.3 s per loop

In [13]: %timeit pd.crosstab([df.id, df.group], df.term)
1 loop, best of 3: 3.37 s per loop

In [14]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int)
1 loop, best of 3: 2.28 s per loop

In [15]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1 loop, best of 3: 1.89 s per loop

Answer 3

与其记住冗长的解决方案，不如想想 Pandas 为您内置的解决方案：

df.groupby(['id', 'group', 'term']).count()

Answer 4

You can use crosstab :您可以使用crosstab ：

print (pd.crosstab([df.id, df.group], df.term))
term      term1  term2  term3
id group                     
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Another solution with groupby with aggregating size , reshaping by unstack :另一个具有聚合size groupby解决方案，通过unstack重塑：

df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)

term      term1  term2  term3
id group                     
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Timings :时间：

df = pd.concat([df]*10000).reset_index(drop=True)

In [48]: %timeit (df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0))
100 loops, best of 3: 12.4 ms per loop

In [49]: %timeit (df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0))
100 loops, best of 3: 12.2 ms per loop

Answer 5

If you want to use value_counts you can use it on a given series, and resort to the following:如果您想使用value_counts您可以在给定的系列上使用它，并采用以下方法：

df.groupby(["id", "group"])["term"].value_counts().unstack(fill_value=0)

or in an equivalent fashion, using the .agg method:或以等效的方式，使用.agg方法：

df.groupby(["id", "group"]).agg({"term": "value_counts"}).unstack(fill_value=0)

Another option is to directly use value_counts on the DataFrame itself without resorting to groupby :另一种选择是直接在 DataFrame 本身上使用value_counts而不求助于groupby ：

df.value_counts().unstack(fill_value=0)

Answer 6

Another alternative:另一种选择：

df.assign(count=1).groupby(['id', 'group','term']).sum().unstack(fill_value=0).xs("count", 1)

term      term1  term2  term3
id group                     
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Groupby 值计数在 dataframe pandas

问题描述

6 个解决方案

解决方案1
136 已采纳 2016-08-24 20:57:41

Timing定时

解决方案2
23 2016-08-24 20:53:14

解决方案3
20 2016-08-24 21:46:53

解决方案4
13 2016-08-24 20:47:25

解决方案5
0 2021-10-14 15:24:18

解决方案6
0 2023-01-05 13:21:42

Groupby 值计数在 dataframe pandas

问题描述

6 个解决方案

解决方案1 136 已采纳 2016-08-24 20:57:41

Timing定时

解决方案2 23 2016-08-24 20:53:14

解决方案3 20 2016-08-24 21:46:53

解决方案4 13 2016-08-24 20:47:25

解决方案5 0 2021-10-14 15:24:18

解决方案6 0 2023-01-05 13:21:42

解决方案1
136 已采纳 2016-08-24 20:57:41

解决方案2
23 2016-08-24 20:53:14

解决方案3
20 2016-08-24 21:46:53

解决方案4
13 2016-08-24 20:47:25

解决方案5
0 2021-10-14 15:24:18

解决方案6
0 2023-01-05 13:21:42