Number of unique values in each Dask Dataframe column

Question

I have a Dask Dataframe called train which is loaded from a large CSV file, and I would like to count the number of unique value in each column. I can clearly do it for each column separately:

    for col in categorical_cols:
        num = train[col].nunique().compute()
        line = f'{col}\t{num}'
        print(line)

However, the above code will go through the huge CSV file for each column, instead of going through the file only once. It takes a plenty of time, and I want it to be faster. If I would write it 'by hand' I would certainly do it with one scan of the file.

Can Dask compute the number of unique values in each column efficiently? Something like DataFrame.nunique() function in Pandas.

Answer 1

Have you tried the drop_duplicates() method, something like this

import dask.dataframe as dd
    
ddf = dd.from_pandas(df, npartitions=n)
    
ddf.drop_duplicates().compute()

Answer 2

You can get the unique number of values in each non-numeric column using .describe()

df.describe(include=['object', 'category']).compute()

If you have category columns with dtype int/float, you would have to convert those columns to categories before applying .describe() to get unique-count statistics. And obviously, getting the unique count of numeric data is not supported.

Number of unique values in each Dask Dataframe column

Question

2 answers

solution1
0 2020-12-27 15:43:53

solution2
0 ACCPTED 2020-12-27 16:05:53

Number of unique values in each Dask Dataframe column

Question

2 answers

solution1 0 2020-12-27 15:43:53

solution2 0 ACCPTED 2020-12-27 16:05:53

solution1
0 2020-12-27 15:43:53

solution2
0 ACCPTED 2020-12-27 16:05:53