![](/img/trans.png)
[英]How to group pandas dataframe rows based on two columns to find the count for each day?
[英]Pandas dataframe find distinct value count for each group in other columns
我有一個Pandas
dataframe
示例輸入,如下所示:
vendor filename language score text
Vendor 1 File 1 chinese 0.67717278 text1
Vendor 2 File 1 chinese 0.644506991 text2
Vendor 1 File 2 chinese 0.67717278 text1
Vendor 2 File 1 chinese 0.644506991 text3
Vendor 1 File 2 Arabic 0.999999523 text3
Vendor 1 File 1 Arabic 0.756420255 text2
Vendor 2 File 3 Arabic 0.999999523 text4
Vendor 1 File 1 Arabic 0.756420255 text4
我要做的是針對每種語言以及在該語言中為每個文件計算score
大於0.5
的text
列中不同的值數。 所以我對上述示例輸入的理想 output 應該是:
Chinese File 1 3
File 2 1
Arabic File 1 2
File 2 1
File 3 1
請注意, File 1
和File 2
都被Chinese
和Arabic
使用,但我想分別計算每種語言的唯一文本值。
我嘗試在下面的代碼中使用pandas
groupby
和unique
function 但這不起作用,因為它會拋出錯誤,因為它會引發錯誤,因為它會因為'DataFrameGroupBy' object has no attribute 'unique'
:
df_1 = df[df["score"] > 0.5].groupby(['language', 'filename']).unique().size()
print("Number of unique text greater than 0.5 score:{}".format(df_1))
解決此問題以達到預期結果的最理想方法是什么?
使用DataFrameGroupBy.nunique
並指定列text
來計算唯一值的數量:
df_1 = df[df["score"] > 0.5].groupby(['language', 'filename'], sort=False)['text'].nunique()
print("Number of unique text greater than 0.5 score:\n{}".format(df_1))
Number of unique text greater than 0.5 score:
language filename
chinese File 1 3
File 2 1
Arabic File 2 1
File 1 2
File 3 1
Name: text, dtype: int64
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.