在每列中查找 DataFrame 中不同元素的計數

Question

我正在嘗試使用 Pandas 查找每列中不同值的計數。 這就是我所做的。

import pandas as pd
import numpy as np

# Generate data.
NROW = 10000
NCOL = 100
df = pd.DataFrame(np.random.randint(1, 100000, (NROW, NCOL)),
                  columns=['col' + x for x in np.arange(NCOL).astype(str)])

我需要計算每列的不同元素的數量，如下所示：

col0    9538
col1    9505
col2    9524

執行此操作的最有效方法是什么，因為此方法將應用於大小大於 1.5GB 的文件？

根據答案， df.apply(lambda x: len(x.unique()))是最快的（ notebook ）。

%timeit df.apply(lambda x: len(x.unique())) 10 loops, best of 3: 49.5 ms per loop %timeit df.nunique() 10 loops, best of 3: 59.7 ms per loop %timeit df.apply(pd.Series.nunique) 10 loops, best of 3: 60.3 ms per loop %timeit df.T.apply(lambda x: x.nunique(), axis=1) 10 loops, best of 3: 60.5 ms per loop

Answer 1

從pandas 0.20 開始，我們可以直接在DataFrame上使用nunique ，即：

df.nunique()
a    4
b    5
c    1
dtype: int64

其他傳統選項：

您可以對 df 進行轉置，然后nunique行使用apply call nunique ：

In [205]:
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

Out[205]:
   a  b  c
0  0  1  1
1  1  2  1
2  1  3  1
3  2  4  1
4  3  5  1

In [206]:
df.T.apply(lambda x: x.nunique(), axis=1)

Out[206]:
a    4
b    5
c    1
dtype: int64

編輯

正如@ajcr 所指出的，轉置是不必要的：

In [208]:
df.apply(pd.Series.nunique)

Out[208]:
a    4
b    5
c    1
dtype: int64

Answer 2

Pandas.Series有一個.value_counts()函數，可以准確地提供您想要的內容。 查看函數的文檔。

Answer 3

這里已經有一些很好的答案:)但是這個似乎不見了：

df.apply(lambda x: x.nunique())

從 pandas 0.20.0 開始， DataFrame.nunique()也可用。

Answer 4

最近，我在計算 DataFrame 中每一列的唯一值時遇到了同樣的問題，我發現了一些其他函數比apply函數運行得更快：

#Select the way how you want to store the output, could be pd.DataFrame or Dict, I will use Dict to demonstrate:
col_uni_val={}
for i in df.columns:
    col_uni_val[i] = len(df[i].unique())

#Import pprint to display dic nicely:
import pprint
pprint.pprint(col_uni_val)

這對我來說幾乎比df.apply(lambda x: len(x.unique()))快兩倍

Answer 5

我發現：

df.agg(['nunique']).T

快多了

Answer 6

df.apply(lambda x: len(x.unique()))

Answer 7

只需要為 pandas_python 中的所有列隔離具有 20 個以上唯一值的列：

enter code here
col_with_morethan_20_unique_values_cat=[]
for col in data.columns:
    if data[col].dtype =='O':
        if len(data[col].unique()) >20:

        ....col_with_morethan_20_unique_values_cat.append(data[col].name)
        else:
            continue

print(col_with_morethan_20_unique_values_cat)
print('total number of columns with more than 20 number of unique value is',len(col_with_morethan_20_unique_values_cat))



 # The o/p will be as:
['CONTRACT NO', 'X2','X3',,,,,,,..]
total number of columns with more than 20 number of unique value is 25

Answer 8

為@CaMaDuPe85 給出的答案添加示例代碼

df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

# df
    a   b   c
0   0   1   1
1   1   2   1
2   1   3   1
3   2   4   1
4   3   5   1


for cs in df.columns:
    print(cs,df[cs].value_counts().count()) 
    # using value_counts in each column and count it 

# Output

a 4
b 5
c 1

在每列中查找 DataFrame 中不同元素的計數

問題描述

8 個解決方案

解決方案1
74 已采納 2015-05-28 10:09:38

解決方案2
6 2015-05-29 11:34:33

解決方案3
5 2017-04-13 11:45:08

解決方案4
1 2016-10-18 20:29:30

解決方案5
1 2020-05-01 20:19:58

解決方案6
0 2018-05-10 14:35:07

解決方案7
0 2019-08-05 11:51:41

解決方案8
0 2020-04-22 17:25:36

在每列中查找 DataFrame 中不同元素的計數

問題描述

8 個解決方案

解決方案1 74 已采納 2015-05-28 10:09:38

解決方案2 6 2015-05-29 11:34:33

解決方案3 5 2017-04-13 11:45:08

解決方案4 1 2016-10-18 20:29:30

解決方案5 1 2020-05-01 20:19:58

解決方案6 0 2018-05-10 14:35:07

解決方案7 0 2019-08-05 11:51:41

解決方案8 0 2020-04-22 17:25:36

解決方案1
74 已采納 2015-05-28 10:09:38

解決方案2
6 2015-05-29 11:34:33

解決方案3
5 2017-04-13 11:45:08

解決方案4
1 2016-10-18 20:29:30

解決方案5
1 2020-05-01 20:19:58

解決方案6
0 2018-05-10 14:35:07

解決方案7
0 2019-08-05 11:51:41

解決方案8
0 2020-04-22 17:25:36