如何用多列的值计数按组汇总 pandas DataFrame？

Question

If this is a dupe please guide the way.如果这是一个骗局，请指路。 I checked a few questions that came close but doesn't solve my issue.我检查了一些接近但没有解决我的问题的问题。

I have a dummy DataFrame as follows:我有一个虚拟DataFrame如下：

   grp  Ax  Bx  Ay  By  A_match  B_match
0  foo   3   2   2   2    False     True
1  foo   2   1   1   0    False    False
2  foo   4   3   0   3    False     True
3  foo   4   3   1   4    False    False
4  foo   4   4   3   0    False    False
5  bar   3   0   3   0     True     True
6  bar   3   4   0   3    False    False
7  bar   1   2   1   2     True     True
8  bar   1   3   4   1    False    False
9  bar   1   1   0   3    False    False

My goal is to compare the A s and B s columns and summarize the result by grp thus:我的目标是比较A s 和B s 列并通过grp总结结果：

           A_match       B_match      
           False  True   False True 
grp                                 
bar            3     2       3     2
foo            5     0       3     2

So I added the two _match columns as follows, to get the above df :所以我添加了两个_match列如下，以获得上面的df ：

df['A_match'] = df['Ax'].eq(df['Ay'])
df['B_match'] = df['Bx'].eq(df['By'])

Based on my understanding, I was hoping I could do something like this, but it doesn't work:根据我的理解，我希望我能做这样的事情，但它不起作用：

df.groupby('grp')[['A_match', 'B_match']].agg(pd.Series.value_counts)

# trunc'd Traceback:
# ... ValueError: no results ...
# ... During handling of the above exception, another exception occurred: ...
# ... ValueError: could not broadcast input array from shape (5,7) into shape (5)

In my actual data, I was able to sidestep this by forcing _match es to be pd.Categorical in a rather unsatisfactory manner.在我的实际数据中，我能够通过以相当不令人满意的方式强制_match es 成为pd.Categorical来回避这一点。 However, I've noted on and off success, and even with this dummy data I'm getting the exact error as above even using pd.Categorial :但是，我已经注意到成功和失败，即使使用这个虚拟数据，即使使用pd.Categorial ，我也会得到与上述完全相同的错误：

df['A_match'] = pd.Categorical(df['Ax'].eq(df['Ay']).values, categories=[True, False])
df['B_match'] = pd.Categorical(df['Bx'].eq(df['By']).values, categories=[True, False])
df.groupby('grp')[['A_match', 'B_match']].agg(pd.Series.value_counts)

# ... ValueError: could not broadcast input array from shape (5,7) into shape (5)

It makes no sense to me - where is shape (5, 7) even coming from?这对我来说毫无意义 - 形状 (5, 7) 甚至来自哪里？ Each agg would have passed a shape (5,) last I checked.上次我检查时，每个agg都会通过一个形状(5,) 。 And even the agg seems to be running differently than I imagined, it should be ran against the Series :甚至agg的运行似乎与我想象的不同，它应该针对Series运行：

>>> df.groupby('grp')[['A_match', 'B_match']].agg(lambda x: type(x))
                                 A_match                              B_match
grp                                                                          
bar  <class 'pandas.core.series.Series'>  <class 'pandas.core.series.Series'>
foo  <class 'pandas.core.series.Series'>  <class 'pandas.core.series.Series'>

# Good - it's Series, I should be able to call value_counts directly?

>>> df.groupby('grp')[['A_match', 'B_match']].agg(lambda x: x.value_counts())

# AttributeError: 'DataFrame' object has no attribute 'value_counts'  <-- ?!?!? Where did 'DataFrame' come from?

I was eventually able to use the following combination, but still rather unsatisfactory as it introduces a lot of unnecessary axis names.我最终能够使用以下组合，但仍然相当不满意，因为它引入了许多不必要的axis名称。

>>> df.melt(id_vars='grp', value_vars=['A_match', 'B_match']).reset_index().pivot_table(index='grp', columns=['variable', 'value'], aggfunc=pd.Series.count)
           index                    
variable A_match       B_match      
value      False True    False True 
grp                                 
bar            3     2       3     2
foo            5     0       3     2

Either method just seem rather contrived to achieve something that should be relatively common usage.这两种方法似乎都是为了实现一些应该是相对常见的用法而做的。 I guess my question is, am I overlooking something obvious here?我想我的问题是，我在这里忽略了一些明显的东西吗？

Answer 1

You can agg on dictionary:您可以在agg上添加：

(df.groupby('grp').agg({'A_match':'value_counts',
                      'B_match':'value_counts'})
   .unstack(-1, fill_value=0)
)

Output: Output：

      A_match       B_match      
      False  True   False  True 
bar     3.0   2.0       3     2
foo     5.0   NaN       3     2

如何用多列的值计数按组汇总 pandas DataFrame？

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-05-23 06:53:41

如何用多列的值计数按组汇总 pandas DataFrame？

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-05-23 06:53:41

解决方案1
3 已采纳 2020-05-23 06:53:41