[英]Finding count of distinct elements in DataFrame in each column
I am trying to find the count of distinct values in each column using Pandas.我正在尝试使用 Pandas 查找每列中不同值的计数。 This is what I did.
这就是我所做的。
import pandas as pd
import numpy as np
# Generate data.
NROW = 10000
NCOL = 100
df = pd.DataFrame(np.random.randint(1, 100000, (NROW, NCOL)),
columns=['col' + x for x in np.arange(NCOL).astype(str)])
I need to count the number of distinct elements for each column, like this:我需要计算每列的不同元素的数量,如下所示:
col0 9538
col1 9505
col2 9524
What would be the most efficient way to do this, as this method will be applied to files which have size greater than 1.5GB?执行此操作的最有效方法是什么,因为此方法将应用于大小大于 1.5GB 的文件?
Based upon the answers, df.apply(lambda x: len(x.unique()))
is the fastest ( notebook ).根据答案,
df.apply(lambda x: len(x.unique()))
是最快的( notebook )。
%timeit df.apply(lambda x: len(x.unique())) 10 loops, best of 3: 49.5 ms per loop %timeit df.nunique() 10 loops, best of 3: 59.7 ms per loop %timeit df.apply(pd.Series.nunique) 10 loops, best of 3: 60.3 ms per loop %timeit df.T.apply(lambda x: x.nunique(), axis=1) 10 loops, best of 3: 60.5 ms per loop
As of pandas 0.20 we can use nunique
directly on DataFrame
s, ie:从pandas 0.20 开始,我们可以直接在
DataFrame
上使用nunique
,即:
df.nunique()
a 4
b 5
c 1
dtype: int64
Other legacy options:其他传统选项:
You could do a transpose of the df and then using apply
call nunique
row-wise:您可以对 df 进行转置,然后
nunique
行使用apply
call nunique
:
In [205]:
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df
Out[205]:
a b c
0 0 1 1
1 1 2 1
2 1 3 1
3 2 4 1
4 3 5 1
In [206]:
df.T.apply(lambda x: x.nunique(), axis=1)
Out[206]:
a 4
b 5
c 1
dtype: int64
EDIT编辑
As pointed out by @ajcr the transpose is unnecessary:正如@ajcr 所指出的,转置是不必要的:
In [208]:
df.apply(pd.Series.nunique)
Out[208]:
a 4
b 5
c 1
dtype: int64
A Pandas.Series
has a .value_counts()
function that provides exactly what you want to. Pandas.Series
有一个.value_counts()
函数,可以准确地提供您想要的内容。 Check out the documentation for the function . 查看函数的文档。
Already some great answers here :) but this one seems to be missing:这里已经有一些很好的答案:)但是这个似乎不见了:
df.apply(lambda x: x.nunique())
As of pandas 0.20.0, DataFrame.nunique()
is also available.从 pandas 0.20.0 开始,
DataFrame.nunique()
也可用。
Recently, I have same issues of counting unique value of each column in DataFrame, and I found some other function that runs faster than the apply
function:最近,我在计算 DataFrame 中每一列的唯一值时遇到了同样的问题,我发现了一些其他函数比
apply
函数运行得更快:
#Select the way how you want to store the output, could be pd.DataFrame or Dict, I will use Dict to demonstrate:
col_uni_val={}
for i in df.columns:
col_uni_val[i] = len(df[i].unique())
#Import pprint to display dic nicely:
import pprint
pprint.pprint(col_uni_val)
This works for me almost twice faster than df.apply(lambda x: len(x.unique()))
这对我来说几乎比
df.apply(lambda x: len(x.unique()))
快两倍
I found:我发现:
df.agg(['nunique']).T
much faster快多了
df.apply(lambda x: len(x.unique()))
Need to segregate only the columns with more than 20 unique values for all the columns in pandas_python:只需要为 pandas_python 中的所有列隔离具有 20 个以上唯一值的列:
enter code here
col_with_morethan_20_unique_values_cat=[]
for col in data.columns:
if data[col].dtype =='O':
if len(data[col].unique()) >20:
....col_with_morethan_20_unique_values_cat.append(data[col].name)
else:
continue
print(col_with_morethan_20_unique_values_cat)
print('total number of columns with more than 20 number of unique value is',len(col_with_morethan_20_unique_values_cat))
# The o/p will be as:
['CONTRACT NO', 'X2','X3',,,,,,,..]
total number of columns with more than 20 number of unique value is 25
Adding the example code for the answer given by @CaMaDuPe85为@CaMaDuPe85 给出的答案添加示例代码
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df
# df
a b c
0 0 1 1
1 1 2 1
2 1 3 1
3 2 4 1
4 3 5 1
for cs in df.columns:
print(cs,df[cs].value_counts().count())
# using value_counts in each column and count it
# Output
a 4
b 5
c 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.