[英]How to group pandas dataframe rows based on two columns to find the count for each day?
[英]Pandas dataframe find distinct value count for each group in other columns
我有一个Pandas
dataframe
示例输入,如下所示:
vendor filename language score text
Vendor 1 File 1 chinese 0.67717278 text1
Vendor 2 File 1 chinese 0.644506991 text2
Vendor 1 File 2 chinese 0.67717278 text1
Vendor 2 File 1 chinese 0.644506991 text3
Vendor 1 File 2 Arabic 0.999999523 text3
Vendor 1 File 1 Arabic 0.756420255 text2
Vendor 2 File 3 Arabic 0.999999523 text4
Vendor 1 File 1 Arabic 0.756420255 text4
我要做的是针对每种语言以及在该语言中为每个文件计算score
大于0.5
的text
列中不同的值数。 所以我对上述示例输入的理想 output 应该是:
Chinese File 1 3
File 2 1
Arabic File 1 2
File 2 1
File 3 1
请注意, File 1
和File 2
都被Chinese
和Arabic
使用,但我想分别计算每种语言的唯一文本值。
我尝试在下面的代码中使用pandas
groupby
和unique
function 但这不起作用,因为它会抛出错误,因为它会引发错误,因为它会因为'DataFrameGroupBy' object has no attribute 'unique'
:
df_1 = df[df["score"] > 0.5].groupby(['language', 'filename']).unique().size()
print("Number of unique text greater than 0.5 score:{}".format(df_1))
解决此问题以达到预期结果的最理想方法是什么?
使用DataFrameGroupBy.nunique
并指定列text
来计算唯一值的数量:
df_1 = df[df["score"] > 0.5].groupby(['language', 'filename'], sort=False)['text'].nunique()
print("Number of unique text greater than 0.5 score:\n{}".format(df_1))
Number of unique text greater than 0.5 score:
language filename
chinese File 1 3
File 2 1
Arabic File 2 1
File 1 2
File 3 1
Name: text, dtype: int64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.