Pandas 'count(distinct)' equivalent

Question

I am using Pandas as a database substitute as I have multiple databases ( Oracle , SQL Server , etc.), and I am unable to make a sequence of commands to a SQL equivalent.

I have a table loaded in a DataFrame with some columns:

YEARMONTH, CLIENTCODE, SIZE, etc., etc.

In SQL, to count the amount of different clients per year would be:

SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;

And the result would be

201301    5000
201302    13245

How can I do that in Pandas?

Answer 1

I believe this is what you want:

table.groupby('YEARMONTH').CLIENTCODE.nunique()

Example:

In [2]: table
Out[2]: 
   CLIENTCODE  YEARMONTH
0           1     201301
1           1     201301
2           2     201301
3           1     201302
4           2     201302
5           2     201302
6           3     201302

In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()
Out[3]: 
YEARMONTH
201301       2
201302       3

Answer 2

Here is another method and it is much simpler. Let's say your dataframe name is daat and the column name is YEARMONTH :

daat.YEARMONTH.value_counts()

Answer 3

有趣的是，通常len(unique())比nunique()快几倍（3x-15x nunique() 。

Answer 4

I am also using nunique but it will be very helpful if you have to use an aggregate function like 'min', 'max', 'count' or 'mean' etc.

df.groupby('YEARMONTH')['CLIENTCODE'].transform('nunique') #count(distinct)
df.groupby('YEARMONTH')['CLIENTCODE'].transform('min')     #min
df.groupby('YEARMONTH')['CLIENTCODE'].transform('max')     #max
df.groupby('YEARMONTH')['CLIENTCODE'].transform('mean')    #average
df.groupby('YEARMONTH')['CLIENTCODE'].transform('count')   #count

Answer 5

Distinct of column along with aggregations on other columns

To get the distinct number of values for any column ( CLIENTCODE in your case), we can use nunique . We can pass the input as a dictionary in agg function, along with aggregations on other columns:

grp_df = df.groupby('YEARMONTH').agg({'CLIENTCODE': ['nunique'],
                                      'other_col_1': ['sum', 'count']})

# to flatten the multi-level columns
grp_df.columns = ["_".join(col).strip() for col in grp_df.columns.values]

# if you wish to reset the index
grp_df.reset_index(inplace=True)

Answer 6

Using crosstab , this will return more information than groupby nunique :

pd.crosstab(df.YEARMONTH,df.CLIENTCODE)
Out[196]:
CLIENTCODE  1  2  3
YEARMONTH
201301      2  1  0
201302      1  2  1

After a little bit of modification, it yields the result:

pd.crosstab(df.YEARMONTH,df.CLIENTCODE).ne(0).sum(1)
Out[197]:
YEARMONTH
201301    2
201302    3
dtype: int64

Answer 7

Here is an approach to have count distinct over multiple columns. Let's have some data:

data = {'CLIENT_CODE':[1,1,2,1,2,2,3],
        'YEAR_MONTH':[201301,201301,201301,201302,201302,201302,201302],
        'PRODUCT_CODE': [100,150,220,400,50,80,100]
       }
table = pd.DataFrame(data)
table

CLIENT_CODE YEAR_MONTH  PRODUCT_CODE
0   1       201301      100
1   1       201301      150
2   2       201301      220
3   1       201302      400
4   2       201302      50
5   2       201302      80
6   3       201302      100

Now, list the columns of interest and use groupby in a slightly modified syntax:

columns = ['YEAR_MONTH', 'PRODUCT_CODE']
table[columns].groupby(table['CLIENT_CODE']).nunique()

We obtain:

YEAR_MONTH  PRODUCT_CODE CLIENT_CODE
1           2            3
2           2            3
3           1            1

Answer 8

使用新的 Pandas 版本，可以轻松获取数据框：

unique_count = pd.groupby(['YEARMONTH'], as_index=False).agg(uniq_CLIENTCODE=('CLIENTCODE', pd.Series.count))

Answer 9

Now you are also able to use dplyr syntax in Python to do it:

>>> from datar.all import f, tibble, group_by, summarise, n_distinct
>>>
>>> data = tibble(
...     CLIENT_CODE=[1,1,2,1,2,2,3],
...     YEAR_MONTH=[201301,201301,201301,201302,201302,201302,201302]
... )
>>>
>>> data >> group_by(f.YEAR_MONTH) >> summarise(n=n_distinct(f.CLIENT_CODE))
   YEAR_MONTH       n
      <int64> <int64>
0      201301       2
1      201302       3

Answer 10

Create a pivot table and use the nunique series function:

ID = [ 123, 123, 123, 456, 456, 456, 456, 789, 789]
domain = ['vk.com', 'vk.com', 'twitter.com', 'vk.com', 'facebook.com',
          'vk.com', 'google.com', 'twitter.com', 'vk.com']
df = pd.DataFrame({'id':ID, 'domain':domain})
fp = pd.pivot_table(data=df, index='domain', aggfunc=pd.Series.nunique)
print(fp)

Output:

               id
domain
facebook.com   1
google.com     1
twitter.com    2
vk.com         3

Answer 11

To obtain the number of different clients and sizes per year (ie number of unique values of multiple columns ), use a list of columns:
```
 df.groupby('YEARMONTH')[['CLIENTCODE', 'SIZE']].nunique()
```
Actually, the result from the above code can be obtained using SQL syntax on df using pandasql (a module built on pandas that lets you query pandas DataFrames using SQL syntax).
```
 #, pip install pandasql from pandasql import sqldf sqldf(""" SELECT COUNT(DISTINCT CLIENTCODE), COUNT(DISTINCT SIZE) FROM df GROUP BY YEARMONTH """)
```

If you want to keep YEARMONTH as a column, ie the analog of the following SQL query

SELECT YEARMONTH, COUNT(DISTINCT CLIENTCODE), COUNT(DISTINCT SIZE) FROM df GROUP BY YEARMONTH

in pandas is the following (set as_index to False ):

 df.groupby('YEARMONTH', as_index=False)[['CLIENTCODE', 'SIZE']].nunique()

If you need to set custom names to the aggregated columns, ie the analog of the following SQL query:

 SELECT YEARMONTH, COUNT(DISTINCT CLIENTCODE) AS `No. clients`, COUNT(DISTINCT SIZE) AS `No. size` FROM df GROUP BY YEARMONTH

use named aggregation in pandas:

 ( df.groupby('YEARMONTH', as_index=False).agg(**{'No. clients':('CLIENTCODE', 'nunique'), 'No. size':('SIZE', 'nunique')}) )

Pandas 'count(distinct)' equivalent

Question

11 answers

solution1
520 ACCPTED 2013-03-14 14:09:06

solution2
119 2017-07-02 11:16:54

solution3
52 2014-05-05 02:59:28

solution4
14 2019-08-01 09:38:19

solution5
6 2020-04-20 10:47:51

Distinct of column along with aggregations on other columns

solution6
5 2018-11-23 15:16:20

solution7
1 2020-02-03 00:40:45

solution8
0 2019-10-02 14:58:40

solution9
0 2021-06-17 02:16:17

solution10
0 2021-06-28 14:15:31

solution11
0 2022-08-30 07:57:26

Pandas 'count(distinct)' equivalent

Question

11 answers

solution1 520 ACCPTED 2013-03-14 14:09:06

solution2 119 2017-07-02 11:16:54

solution3 52 2014-05-05 02:59:28

solution4 14 2019-08-01 09:38:19

solution5 6 2020-04-20 10:47:51

Distinct of column along with aggregations on other columns

solution6 5 2018-11-23 15:16:20

solution7 1 2020-02-03 00:40:45

solution8 0 2019-10-02 14:58:40

solution9 0 2021-06-17 02:16:17

solution10 0 2021-06-28 14:15:31

solution11 0 2022-08-30 07:57:26

solution1
520 ACCPTED 2013-03-14 14:09:06

solution2
119 2017-07-02 11:16:54

solution3
52 2014-05-05 02:59:28

solution4
14 2019-08-01 09:38:19

solution5
6 2020-04-20 10:47:51

solution6
5 2018-11-23 15:16:20

solution7
1 2020-02-03 00:40:45

solution8
0 2019-10-02 14:58:40

solution9
0 2021-06-17 02:16:17

solution10
0 2021-06-28 14:15:31

solution11
0 2022-08-30 07:57:26