Pandas dataframe 在其他列中找到每個組的不同值計數

Question

我有一個Pandas dataframe示例輸入，如下所示：

vendor  filename    language    score         text
Vendor 1    File 1  chinese 0.67717278        text1  
Vendor 2    File 1  chinese 0.644506991       text2
Vendor 1    File 2  chinese 0.67717278        text1  
Vendor 2    File 1  chinese 0.644506991       text3
Vendor 1    File 2  Arabic 0.999999523        text3
Vendor 1    File 1  Arabic 0.756420255        text2
Vendor 2    File 3  Arabic 0.999999523        text4
Vendor 1    File 1  Arabic 0.756420255        text4

我要做的是針對每種語言以及在該語言中為每個文件計算score大於0.5的text列中不同的值數。 所以我對上述示例輸入的理想 output 應該是：

Chinese  File 1  3
         File 2  1

Arabic   File 1  2
         File 2  1
         File 3  1

請注意， File 1和File 2都被Chinese和Arabic使用，但我想分別計算每種語言的唯一文本值。

我嘗試在下面的代碼中使用pandas groupby和unique function 但這不起作用，因為它會拋出錯誤，因為它會引發錯誤，因為它會因為'DataFrameGroupBy' object has no attribute 'unique' ：

df_1 = df[df["score"] > 0.5].groupby(['language', 'filename']).unique().size()
    
print("Number of unique text greater than 0.5 score:{}".format(df_1))

解決此問題以達到預期結果的最理想方法是什么？

Answer 1

使用DataFrameGroupBy.nunique並指定列text來計算唯一值的數量：

df_1 = df[df["score"] > 0.5].groupby(['language', 'filename'], sort=False)['text'].nunique()

print("Number of unique text greater than 0.5 score:\n{}".format(df_1))
Number of unique text greater than 0.5 score:
language  filename
chinese   File 1      3
          File 2      1
Arabic    File 2      1
          File 1      2
          File 3      1
Name: text, dtype: int64

Pandas dataframe 在其他列中找到每個組的不同值計數

問題描述

1 個解決方案

解決方案1
1 2021-02-08 05:44:05

Pandas dataframe 在其他列中找到每個組的不同值計數

問題描述

1 個解決方案

解決方案1 1 2021-02-08 05:44:05

解決方案1
1 2021-02-08 05:44:05