每个 Dask Dataframe 列中唯一值的数量

Question

I have a Dask Dataframe called train which is loaded from a large CSV file, and I would like to count the number of unique value in each column.我有一个名为train的 Dask Dataframe ，它是从一个大型 CSV 文件加载的，我想计算每列中唯一值的数量。 I can clearly do it for each column separately:我可以清楚地为每一列分别做：

    for col in categorical_cols:
        num = train[col].nunique().compute()
        line = f'{col}\t{num}'
        print(line)

However, the above code will go through the huge CSV file for each column, instead of going through the file only once.但是，上面的代码将 go 通过每个列的巨大 CSV 文件，而不是只通过文件一次。 It takes a plenty of time, and I want it to be faster.这需要很多时间，我希望它更快。 If I would write it 'by hand' I would certainly do it with one scan of the file.如果我要“手工”编写它，我肯定会通过一次扫描文件来完成。

Can Dask compute the number of unique values in each column efficiently? Dask 可以有效地计算每列中唯一值的数量吗？ Something like DataFrame.nunique() function in Pandas. Pandas 中的DataFrame.nunique() function 之类的东西。

Answer 1

Have you tried the drop_duplicates() method, something like this你有没有试过 drop_duplicates() 方法，像这样

import dask.dataframe as dd
    
ddf = dd.from_pandas(df, npartitions=n)
    
ddf.drop_duplicates().compute()

Answer 2

You can get the unique number of values in each non-numeric column using .describe()您可以使用.describe()在每个非数字列中获取唯一数量的值

df.describe(include=['object', 'category']).compute()

If you have category columns with dtype int/float, you would have to convert those columns to categories before applying .describe() to get unique-count statistics.如果您有 dtype int/float 的类别列，则必须在应用.describe()之前将这些列转换为类别以获取唯一计数统计信息。 And obviously, getting the unique count of numeric data is not supported.显然，不支持获取数字数据的唯一计数。

每个 Dask Dataframe 列中唯一值的数量

问题描述

2 个解决方案

解决方案1
0 2020-12-27 15:43:53

解决方案2
0 已采纳 2020-12-27 16:05:53

每个 Dask Dataframe 列中唯一值的数量

问题描述

2 个解决方案

解决方案1 0 2020-12-27 15:43:53

解决方案2 0 已采纳 2020-12-27 16:05:53

解决方案1
0 2020-12-27 15:43:53

解决方案2
0 已采纳 2020-12-27 16:05:53