[英]Perform COUNTIF with GROUP BY in Pandas - Python 3.x
I have a dataframe, df
, which looks like this:我有一个 dataframe,
df
,它看起来像这样:
| | rating | foo1 | foo2 | foo3 | foo4 | foo5 |
|:--:|:------:|:-----:|:----:|:-----:|:----:|:-----:|
| 1 | 2 | 0 | 0 | 0.98 | 0 | 0.7 |
| 2 | 2 | 0 | 0 | 0 | 0.3 | 0.007 |
| 3 | 2 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | 0.1 | 0.99 | 0 | 0 | 0.005 |
| 5 | 4 | 0 | 0 | 0 | 0 | 0.01 |
| 6 | 2 | 0 | 0 | 0.66 | 0 | 0.27 |
| 7 | 4 | 0 | 0.92 | 0.32 | 0 | 0.11 |
| 8 | 2 | 0.003 | 0 | 0.073 | 0 | 0.218 |
| 9 | 4 | 0 | 0 | 0 | 0 | 0.004 |
| 10 | 4 | 0 | 0 | 0 | 0 | 0.001 |
except that I have about 13,000 features, and only care about a certain subset (say foo1, foo2, foo3, foo4, and foo5)除了我有大约 13,000 个特征,并且只关心某个子集(比如 foo1、foo2、foo3、foo4 和 foo5)
The shape of my df
is: 2000 rows x 13984 columns
我的
df
的形状是: 2000 rows x 13984 columns
What I need to do is count the number of non zeroes per column and group it by the rating, to hopefully produce a result like:我需要做的是计算每列非零的数量并按评级对其进行分组,以希望产生如下结果:
| | foo1 | foo2 | foo3 | foo4 | foo5 |
|:-:|:----:|:----:|:----:|:----:|:----:|
| 2 | 1 | 0 | 3 | 1 | 4 |
| 4 | 1 | 2 | 1 | 0 | 5 |
I know in SQL, I could do something like:我知道在 SQL 中,我可以这样做:
SELECT
rating,
SUM(CASE WHEN foo1 != 0 THEN 1 ELSE 0 END) as foo1,
SUM(CASE WHEN foo2 != 0 THEN 1 ELSE 0 END) as foo2,
SUM(CASE WHEN foo3 != 0 THEN 1 ELSE 0 END) as foo3,
SUM(CASE WHEN foo4 != 0 THEN 1 ELSE 0 END) as foo4,
SUM(CASE WHEN foo5 != 0 THEN 1 ELSE 0 END) as foo5
FROM
df
GROUP BY
rating
I have found this Stack Overflow post but this is how to create a similar calculation for all columns , and I only care about a specific five ( foo1
, foo2
, foo3
, foo4
, foo5
)我找到了这个 Stack Overflow 帖子,但这是如何为所有列创建类似的计算,我只关心特定的五个(
foo1
、 foo2
、 foo3
、 foo4
、 foo5
)
How can I write a solution to achieve the desired result using python pandas?如何使用 python pandas 编写解决方案以达到预期结果?
If I understand you correctly, first set_index
to rating
, then groupby
:如果我理解正确,首先
set_index
到rating
,然后groupby
:
import numpy as np
import pandas as pd
np.random.seed(500)
e = {"rating":np.random.choice([2,4],100),
"foo1": np.random.randint(0,2,100),
"foo2": np.random.randint(0,2,100),
"foo3": np.random.randint(0,2,100),
"foo4": np.random.randint(0,2,100)}
df = pd.DataFrame(e)
df = df.set_index("rating")
print (df.groupby(df.index).apply(lambda x: x.ne(0).sum()))
#
foo1 foo2 foo3 foo4
rating
2 21 21 24 19
4 32 26 24 30
You can do it this way你可以这样做
cols=df.columns[1:6]
df.groupby('rating')[cols].apply(lambda x: x.ne(0).sum()).reset_index()
# #
rating foo1 foo2 foo3 foo4 foo5
0 2 1 0 3 1 4
1 4 1 2 1 0 5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.