简体   繁体   中英

Counting unique index values in Pandas groupby

In Pandas, there is a very clean way to count the distinct values in a column within a group by operation. For example

ex = pd.DataFrame([[1, 2, 3], [6, 7, 8], [1, 7, 9]], 
                  columns=["A", "B", "C"]).set_index(["A", "B"])
ex.groupby(level="A").C.nunique()

will return

A
1    2
6    1
Name: C, dtype: int64

I would also like to count the distinct values in index level B while grouping by A . I can't find a clean way to access the levels of B from the groupby object. The best I've been able to come up with is:

ex.reset_index("B", drop=False).groupby(level="A").B.nunique()

which correctly returns:

A
1    2
6    1
Name: B, dtype: int64 

Is there a way for me to do this on the groupby without resetting the index or using an apply function?

IIUC you could do reset_index for all levels, then groupby be 'A' and apply nunique method:

res = ex.reset_index().groupby('A').agg(lambda x: x.nunique())

In [339]: res
Out[339]:
   B  C
A
1  2  2
6  1  1

Same solution with pivot_table :

In [341]: ex.reset_index().pivot_table(index='A', aggfunc=lambda x: x.nunique())
Out[341]:
   B  C
A
1  2  2
6  1  1

Not sure if this is any better, but it doesn't use an apply or reset index :)

In [20]: ex.groupby(level="A").agg(lambda x: x.index.get_level_values(1).nunique())
Out[20]:
   C
A
1  2
6  1

FWIW, I find it useful to break these apart when developing a complicated groupby. You can view the individual objects you'll work with by

In [24]: ex.groupby(level="A").get_group(1)
Out[24]:
     C
A B
1 2  3
  7  9

Together:

In [33]: (ex.groupby(level='A')
   ....:    .C.agg({'a': lambda x: x.index.get_level_values(1).nunique(),
   ....:            'b': 'nunique'}))
Out[33]:
   b  a
A
1  2  2
6  1  1

For your amusement, a not-so-easy-to-read-out-loud solution that does not use reset_index , or apply , or agg , or anonymous functions. However, it does use zip and Counter from the standard library.

import pandas as pd
from collections import Counter

ex = pd.DataFrame([[1, 2, 3], [6, 7, 8], [1, 7, 9]], 
                  columns=["A", "B", "C"]).set_index(["A", "B"])

A_val, nunique_B = zip(*[(k, len(Counter(v.index.labels[v.index.names.index('B')]))) 
                      for k, v in ex.groupby(level='A')])

pd.Series(nunique_B, index=pd.Int64Index(A_val, name='A'))

returns

A
1    2
6    1
dtype: int32

Also, for generality I do not assume that B is at level 1 of the index.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM