熊貓數據框和計數中選定列中值的唯一組合

Question

我在熊貓數據框中的數據如下：

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],
                   'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})

所以，我的數據看起來像這樣

----------------------------
index         A        B
0           yes      yes
1           yes       no
2           yes       no
3           yes       no
4            no      yes
5            no      yes
6           yes       no
7           yes      yes
8           yes      yes
9            no       no
-----------------------------

我想將其轉換為另一個數據框。 預期的輸出可以顯示在以下 python 腳本中：

output = pd.DataFrame({'A':['no','no','yes','yes'],'B':['no','yes','no','yes'],'count':[1,2,4,3]})

所以，我的預期輸出看起來像這樣

--------------------------------------------
index      A       B       count
--------------------------------------------
0         no       no        1
1         no      yes        2
2        yes       no        4
3        yes      yes        3
--------------------------------------------

實際上，我可以使用以下命令找到所有組合並計算它們： mytable = df1.groupby(['A','B']).size()

然而，事實證明，這些組合在單個列中。 我想將組合中的每個值分成不同的列，並為計數結果再添加一列。 有可能這樣做嗎？ 我可以有你的建議嗎？ 先感謝您。

Answer 1

您可以groupby上的cols“A”和“B”和呼叫size ，然后reset_index和rename生成列：

In [26]:

df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[26]:
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3

更新

稍微解釋一下，通過對 2 列進行分組，這對 A 和 B 值相同的行進行分組，我們稱之為size ，它返回唯一組的數量：

In[202]:
df1.groupby(['A','B']).size()

Out[202]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64

所以現在要恢復分組列，我們調用reset_index ：

In[203]:
df1.groupby(['A','B']).size().reset_index()

Out[203]: 
     A    B  0
0   no   no  1
1   no  yes  2
2  yes   no  4
3  yes  yes  3

這將恢復索引，但大小聚合變成了生成的列0 ，因此我們必須重命名它：

In[204]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})

Out[204]: 
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3

groupby確實接受我們可以設置為False的參數as_index ，因此它不會使分組列成為索引，但這會生成一個series ，您仍然必須恢復索引等等....：

In[205]:
df1.groupby(['A','B'], as_index=False).size()

Out[205]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64

Answer 2

在 Pandas 1.1.0 中，您可以將value_counts方法與 DataFrames 一起使用：

df.value_counts() # or df[['A', 'B']].value_counts()

結果：

A    B
yes  no     4
     yes    3
no   yes    2
     no     1
dtype: int64

將索引轉換為列並按值計數排序：

df.value_counts(ascending=True).reset_index(name='count')

結果：

     A    B  count
0   no   no      1
1   no  yes      2
2  yes  yes      3
3  yes   no      4

Answer 3

稍微相關，我正在尋找獨特的組合，我想出了這個方法：

def unique_columns(df,columns):

    result = pd.Series(index = df.index)

    groups = meta_data_csv.groupby(by = columns)
    for name,group in groups:
       is_unique = len(group) == 1
       result.loc[group.index] = is_unique

    assert not result.isnull().any()

    return result

如果您只想斷言所有組合都是唯一的：

df1.set_index(['A','B']).index.is_unique

Answer 4

我還沒有對此進行時間測試，但嘗試起來很有趣。 基本上將兩列轉換為一列元組。 現在將其轉換為數據幀，執行 'value_counts()' 查找唯一元素並對其進行計數。 再次擺弄 zip 並將列按您想要的順序排列。 您可能可以使步驟更優雅，但對於這個問題，使用元組對我來說似乎更自然

b = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})

b['count'] = pd.Series(zip(*[b.A,b.B]))
df = pd.DataFrame(b['count'].value_counts().reset_index())
df['A'], df['B'] = zip(*df['index'])
df = df.drop(columns='index')[['A','B','count']]

Answer 5

根據接受的答案和@Bryan P關於count() 和 size()之間差異的評論，我選擇 count() 以獲得更清晰的代碼，如下所示：

df1.groupby(['A','B']).count().reset_index()

Answer 6

將@EdChum 非常好的答案放入函數count_unique_index 。 獨特的方法僅適用於熊貓系列，不適用於數據框。 下面的函數再現了 R 中唯一函數的行為：

unique 返回一個向量、數據框或數組，如 x 但刪除了重復的元素/行。

並根據 OP 的要求添加出現次數。

def count_unique_index(df, by):                                                                                                                                                 
    return df.groupby(by).size().reset_index().rename(columns={0:'count'})                                                                                                      

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],                                                                                             
                    'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})   
                                                                                                                                                                                 
count_unique_index(df1, ['A','B'])                                                                                                                                              
     A    B  count                                                                                                                                                                  
0   no   no      1                                                                                                                                                                  
1   no  yes      2                                                                                                                                                                  
2  yes   no      4                                                                                                                                                                  
3  yes  yes      3

熊貓數據框和計數中選定列中值的唯一組合

問題描述

6 個解決方案

解決方案1
226 已采納 2016-02-08 11:46:41

解決方案2
4 2021-06-08 07:56:25

解決方案3
2 2018-09-27 16:18:17

解決方案4
0 2020-07-27 19:39:14

解決方案5
0 2022-01-14 13:13:31

解決方案6
-1 2019-06-20 09:45:16

熊貓數據框和計數中選定列中值的唯一組合

問題描述

6 個解決方案

解決方案1 226 已采納 2016-02-08 11:46:41

解決方案2 4 2021-06-08 07:56:25

解決方案3 2 2018-09-27 16:18:17

解決方案4 0 2020-07-27 19:39:14

解決方案5 0 2022-01-14 13:13:31

解決方案6 -1 2019-06-20 09:45:16

解決方案1
226 已采納 2016-02-08 11:46:41

解決方案2
4 2021-06-08 07:56:25

解決方案3
2 2018-09-27 16:18:17

解決方案4
0 2020-07-27 19:39:14

解決方案5
0 2022-01-14 13:13:31

解決方案6
-1 2019-06-20 09:45:16