[英]How to summarize a pandas DataFrame by group with value counts of multiple columns?
If this is a dupe please guide the way.如果这是一个骗局,请指路。 I checked a few questions that came close but doesn't solve my issue.
我检查了一些接近但没有解决我的问题的问题。
I have a dummy DataFrame
as follows:我有一个虚拟
DataFrame
如下:
grp Ax Bx Ay By A_match B_match
0 foo 3 2 2 2 False True
1 foo 2 1 1 0 False False
2 foo 4 3 0 3 False True
3 foo 4 3 1 4 False False
4 foo 4 4 3 0 False False
5 bar 3 0 3 0 True True
6 bar 3 4 0 3 False False
7 bar 1 2 1 2 True True
8 bar 1 3 4 1 False False
9 bar 1 1 0 3 False False
My goal is to compare the A
s and B
s columns and summarize the result by grp
thus:我的目标是比较
A
s 和B
s 列并通过grp
总结结果:
A_match B_match
False True False True
grp
bar 3 2 3 2
foo 5 0 3 2
So I added the two _match
columns as follows, to get the above df
:所以我添加了两个
_match
列如下,以获得上面的df
:
df['A_match'] = df['Ax'].eq(df['Ay'])
df['B_match'] = df['Bx'].eq(df['By'])
Based on my understanding, I was hoping I could do something like this, but it doesn't work:根据我的理解,我希望我能做这样的事情,但它不起作用:
df.groupby('grp')[['A_match', 'B_match']].agg(pd.Series.value_counts)
# trunc'd Traceback:
# ... ValueError: no results ...
# ... During handling of the above exception, another exception occurred: ...
# ... ValueError: could not broadcast input array from shape (5,7) into shape (5)
In my actual data, I was able to sidestep this by forcing _match
es to be pd.Categorical
in a rather unsatisfactory manner.在我的实际数据中,我能够通过以相当不令人满意的方式强制
_match
es 成为pd.Categorical
来回避这一点。 However, I've noted on and off success, and even with this dummy data I'm getting the exact error as above even using pd.Categorial
:但是,我已经注意到成功和失败,即使使用这个虚拟数据,即使使用
pd.Categorial
,我也会得到与上述完全相同的错误:
df['A_match'] = pd.Categorical(df['Ax'].eq(df['Ay']).values, categories=[True, False])
df['B_match'] = pd.Categorical(df['Bx'].eq(df['By']).values, categories=[True, False])
df.groupby('grp')[['A_match', 'B_match']].agg(pd.Series.value_counts)
# ... ValueError: could not broadcast input array from shape (5,7) into shape (5)
It makes no sense to me - where is shape (5, 7) even coming from?这对我来说毫无意义 - 形状 (5, 7) 甚至来自哪里? Each
agg
would have passed a shape (5,)
last I checked.上次我检查时,每个
agg
都会通过一个形状(5,)
。 And even the agg
seems to be running differently than I imagined, it should be ran against the Series
:甚至
agg
的运行似乎与我想象的不同,它应该针对Series
运行:
>>> df.groupby('grp')[['A_match', 'B_match']].agg(lambda x: type(x))
A_match B_match
grp
bar <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
foo <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
# Good - it's Series, I should be able to call value_counts directly?
>>> df.groupby('grp')[['A_match', 'B_match']].agg(lambda x: x.value_counts())
# AttributeError: 'DataFrame' object has no attribute 'value_counts' <-- ?!?!? Where did 'DataFrame' come from?
I was eventually able to use the following combination, but still rather unsatisfactory as it introduces a lot of unnecessary axis
names.我最终能够使用以下组合,但仍然相当不满意,因为它引入了许多不必要的
axis
名称。
>>> df.melt(id_vars='grp', value_vars=['A_match', 'B_match']).reset_index().pivot_table(index='grp', columns=['variable', 'value'], aggfunc=pd.Series.count)
index
variable A_match B_match
value False True False True
grp
bar 3 2 3 2
foo 5 0 3 2
Either method just seem rather contrived to achieve something that should be relatively common usage.这两种方法似乎都是为了实现一些应该是相对常见的用法而做的。 I guess my question is, am I overlooking something obvious here?
我想我的问题是,我在这里忽略了一些明显的东西吗?
You can agg
on dictionary:您可以在
agg
上添加:
(df.groupby('grp').agg({'A_match':'value_counts',
'B_match':'value_counts'})
.unstack(-1, fill_value=0)
)
Output: Output:
A_match B_match
False True False True
bar 3.0 2.0 3 2
foo 5.0 NaN 3 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.