简体   繁体   中英

Replicate Countifs() in pandas with multiple conditions

I have a DataFrame of the form:

在此输入图像描述

What I would like to achieve is a DataFrame that has unique groups and a count of non-zeros for each column 1-9. It would look like the below.

在此输入图像描述

I'm pretty much at a loss on this. I saw the question here , but this doesn't address the grouping aspect. How can I achieve this using pandas?

Assuming you have the following DF:

In [82]: df
Out[82]:
    Group  1  2  3  4  5
0  Group1  0  1  4  0  1
1  Group1  3  0  4  1  5
2  Group2  0  1  4  3  6
3  Group2  5  1  4  0  7
4  Group3  0  0  4  7  8
5  Group3  7  1  4  7  9

Solution:

In [83]: df.set_index('Group').ne(0).groupby(level=0).sum().reset_index()
Out[83]:
    Group    1    2    3    4    5
0  Group1  1.0  1.0  2.0  1.0  2.0
1  Group2  1.0  2.0  2.0  1.0  2.0
2  Group3  1.0  1.0  2.0  2.0  2.0

as integers:

In [84]: df.set_index('Group').ne(0).groupby(level=0).sum().astype(int).reset_index()
Out[84]:
    Group  1  2  3  4  5
0  Group1  1  1  2  1  2
1  Group2  1  2  2  1  2
2  Group3  1  1  2  2  2

UPDATE:

In [92]: df
Out[92]:
    Group  1  2  3  4  5
0  Group1  0  1  4  0  0
1  Group1  3  0  4  1  0
2  Group2  0  1  4  3  0
3  Group2  5  1  4  0  0
4  Group3  0  0  4  7  0
5  Group3  7  1  4  7  0

In [93]: df.set_index('Group').ne(0).groupby(level=0).sum().astype(int).reset_index()
Out[93]:
    Group  1  2  3  4  5
0  Group1  1  1  2  1  0
1  Group2  1  2  2  1  0
2  Group3  1  1  2  2  0

Timing:

In [37]: %timeit df.set_index('Group').ne(0).groupby(level=0).sum().astype(int).reset_index()
100 loops, best of 3: 6.24 ms per loop

In [38]: %timeit df.pivot_table(index='Group', aggfunc=np.count_nonzero)
100 loops, best of 3: 19.2 ms per loop

let's test it against a bigger (60.000 rows) DF:

In [39]: df = pd.concat([df] * 10**4, ignore_index=True)

In [40]: df.shape
Out[40]: (60000, 6)

In [42]: %timeit df.set_index('Group').ne(0).groupby(level=0).sum().astype(int).reset_index()
10 loops, best of 3: 22.4 ms per loop

In [43]: %timeit df.pivot_table(index='Group', aggfunc=np.count_nonzero)
10 loops, best of 3: 43 ms per loop

Conclusion:

IMO high-level function pivot_table is slower due to additional overhead, but t he difference is not that huge for bigger data sets...

Using df from @MaxU solution:

df.pivot_table(index='Group', aggfunc=np.count_nonzero)


        1   2   3   4   5
Group                   
Group1  1   1   2   1   2
Group2  1   2   2   1   2
Group3  1   1   2   2   2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM