简体   繁体   English

在每列中查找 DataFrame 中不同元素的计数

[英]Finding count of distinct elements in DataFrame in each column

I am trying to find the count of distinct values in each column using Pandas.我正在尝试使用 Pandas 查找每列中不同值的计数。 This is what I did.这就是我所做的。

import pandas as pd
import numpy as np

# Generate data.
NROW = 10000
NCOL = 100
df = pd.DataFrame(np.random.randint(1, 100000, (NROW, NCOL)),
                  columns=['col' + x for x in np.arange(NCOL).astype(str)])

I need to count the number of distinct elements for each column, like this:我需要计算每列的不同元素的数量,如下所示:

col0    9538
col1    9505
col2    9524

What would be the most efficient way to do this, as this method will be applied to files which have size greater than 1.5GB?执行此操作的最有效方法是什么,因为此方法将应用于大小大于 1.5GB 的文件?


Based upon the answers, df.apply(lambda x: len(x.unique())) is the fastest ( notebook ).根据答案, df.apply(lambda x: len(x.unique()))是最快的( notebook )。

%timeit df.apply(lambda x: len(x.unique())) 10 loops, best of 3: 49.5 ms per loop %timeit df.nunique() 10 loops, best of 3: 59.7 ms per loop %timeit df.apply(pd.Series.nunique) 10 loops, best of 3: 60.3 ms per loop %timeit df.T.apply(lambda x: x.nunique(), axis=1) 10 loops, best of 3: 60.5 ms per loop

As of pandas 0.20 we can use nunique directly on DataFrame s, ie:pandas 0.20 开始,我们可以直接在DataFrame上使用nunique ,即:

df.nunique()
a    4
b    5
c    1
dtype: int64

Other legacy options:其他传统选项:

You could do a transpose of the df and then using apply call nunique row-wise:您可以对 df 进行转置,然后nunique行使用apply call nunique

In [205]:
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

Out[205]:
   a  b  c
0  0  1  1
1  1  2  1
2  1  3  1
3  2  4  1
4  3  5  1

In [206]:
df.T.apply(lambda x: x.nunique(), axis=1)

Out[206]:
a    4
b    5
c    1
dtype: int64

EDIT编辑

As pointed out by @ajcr the transpose is unnecessary:正如@ajcr 所指出的,转置是不必要的:

In [208]:
df.apply(pd.Series.nunique)

Out[208]:
a    4
b    5
c    1
dtype: int64

A Pandas.Series has a .value_counts() function that provides exactly what you want to. Pandas.Series有一个.value_counts()函数,可以准确地提供您想要的内容。 Check out the documentation for the function . 查看函数的文档

Already some great answers here :) but this one seems to be missing:这里已经有一些很好的答案:)但是这个似乎不见了:

df.apply(lambda x: x.nunique())

As of pandas 0.20.0, DataFrame.nunique() is also available.从 pandas 0.20.0 开始, DataFrame.nunique()也可用。

Recently, I have same issues of counting unique value of each column in DataFrame, and I found some other function that runs faster than the apply function:最近,我在计算 DataFrame 中每一列的唯一值时遇到了同样的问题,我发现了一些其他函数比apply函数运行得更快:

#Select the way how you want to store the output, could be pd.DataFrame or Dict, I will use Dict to demonstrate:
col_uni_val={}
for i in df.columns:
    col_uni_val[i] = len(df[i].unique())

#Import pprint to display dic nicely:
import pprint
pprint.pprint(col_uni_val)

This works for me almost twice faster than df.apply(lambda x: len(x.unique()))这对我来说几乎比df.apply(lambda x: len(x.unique()))快两倍

I found:我发现:

df.agg(['nunique']).T

much faster快多了

df.apply(lambda x: len(x.unique()))

Need to segregate only the columns with more than 20 unique values for all the columns in pandas_python:只需要为 pandas_python 中的所有列隔离具有 20 个以上唯一值的列:

enter code here
col_with_morethan_20_unique_values_cat=[]
for col in data.columns:
    if data[col].dtype =='O':
        if len(data[col].unique()) >20:

        ....col_with_morethan_20_unique_values_cat.append(data[col].name)
        else:
            continue

print(col_with_morethan_20_unique_values_cat)
print('total number of columns with more than 20 number of unique value is',len(col_with_morethan_20_unique_values_cat))



 # The o/p will be as:
['CONTRACT NO', 'X2','X3',,,,,,,..]
total number of columns with more than 20 number of unique value is 25

Adding the example code for the answer given by @CaMaDuPe85为@CaMaDuPe85 给出的答案添加示例代码

df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

# df
    a   b   c
0   0   1   1
1   1   2   1
2   1   3   1
3   2   4   1
4   3   5   1


for cs in df.columns:
    print(cs,df[cs].value_counts().count()) 
    # using value_counts in each column and count it 

# Output

a 4
b 5
c 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM