简体   繁体   中英

Number of unique values in each Dask Dataframe column

I have a Dask Dataframe called train which is loaded from a large CSV file, and I would like to count the number of unique value in each column. I can clearly do it for each column separately:

    for col in categorical_cols:
        num = train[col].nunique().compute()
        line = f'{col}\t{num}'
        print(line)

However, the above code will go through the huge CSV file for each column, instead of going through the file only once. It takes a plenty of time, and I want it to be faster. If I would write it 'by hand' I would certainly do it with one scan of the file.

Can Dask compute the number of unique values in each column efficiently? Something like DataFrame.nunique() function in Pandas.

Have you tried the drop_duplicates() method, something like this

import dask.dataframe as dd
    
ddf = dd.from_pandas(df, npartitions=n)
    
ddf.drop_duplicates().compute()

You can get the unique number of values in each non-numeric column using .describe()

df.describe(include=['object', 'category']).compute()

If you have category columns with dtype int/float, you would have to convert those columns to categories before applying .describe() to get unique-count statistics. And obviously, getting the unique count of numeric data is not supported.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM