简体   繁体   English

Pandas dataframe 在其他列中找到每个组的不同值计数

[英]Pandas dataframe find distinct value count for each group in other columns

I have a Pandas dataframe a sample input of which looks like below:我有一个Pandas dataframe示例输入,如下所示:

vendor  filename    language    score         text
Vendor 1    File 1  chinese 0.67717278        text1  
Vendor 2    File 1  chinese 0.644506991       text2
Vendor 1    File 2  chinese 0.67717278        text1  
Vendor 2    File 1  chinese 0.644506991       text3
Vendor 1    File 2  Arabic 0.999999523        text3
Vendor 1    File 1  Arabic 0.756420255        text2
Vendor 2    File 3  Arabic 0.999999523        text4
Vendor 1    File 1  Arabic 0.756420255        text4

What I am trying to do is for each language and within that language for each file, count the distinct number of values in text column where score is greater than 0.5 .我要做的是针对每种语言以及在该语言中为每个文件计算score大于0.5text列中不同的值数。 So my ideal output for above sample input should be:所以我对上述示例输入的理想 output 应该是:

Chinese  File 1  3
         File 2  1

Arabic   File 1  2
         File 2  1
         File 3  1

Note that File 1 and File 2 are both used by Chinese and Arabic but I want to count their unique text values separately for each language.请注意, File 1File 2都被ChineseArabic使用,但我想分别计算每种语言的唯一文本值。

I tried to use pandas groupby and unique function in below code but this is not working as it throws error as 'DataFrameGroupBy' object has no attribute 'unique' :我尝试在下面的代码中使用pandas groupbyunique function 但这不起作用,因为它会抛出错误,因为它会引发错误,因为它会因为'DataFrameGroupBy' object has no attribute 'unique'

df_1 = df[df["score"] > 0.5].groupby(['language', 'filename']).unique().size()
    
print("Number of unique text greater than 0.5 score:{}".format(df_1))

What is the most ideal way to resolve this issue achieve the intended outcome?解决此问题以达到预期结果的最理想方法是什么?

Use DataFrameGroupBy.nunique with specify column text for count number of unique values:使用DataFrameGroupBy.nunique并指定列text来计算唯一值的数量:

df_1 = df[df["score"] > 0.5].groupby(['language', 'filename'], sort=False)['text'].nunique()

print("Number of unique text greater than 0.5 score:\n{}".format(df_1))
Number of unique text greater than 0.5 score:
language  filename
chinese   File 1      3
          File 2      1
Arabic    File 2      1
          File 1      2
          File 3      1
Name: text, dtype: int64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM