Pandas dataframe 在其他列中找到每个组的不同值计数

Question

我有一个Pandas dataframe示例输入，如下所示：

vendor  filename    language    score         text
Vendor 1    File 1  chinese 0.67717278        text1  
Vendor 2    File 1  chinese 0.644506991       text2
Vendor 1    File 2  chinese 0.67717278        text1  
Vendor 2    File 1  chinese 0.644506991       text3
Vendor 1    File 2  Arabic 0.999999523        text3
Vendor 1    File 1  Arabic 0.756420255        text2
Vendor 2    File 3  Arabic 0.999999523        text4
Vendor 1    File 1  Arabic 0.756420255        text4

我要做的是针对每种语言以及在该语言中为每个文件计算score大于0.5的text列中不同的值数。 所以我对上述示例输入的理想 output 应该是：

Chinese  File 1  3
         File 2  1

Arabic   File 1  2
         File 2  1
         File 3  1

请注意， File 1和File 2都被Chinese和Arabic使用，但我想分别计算每种语言的唯一文本值。

我尝试在下面的代码中使用pandas groupby和unique function 但这不起作用，因为它会抛出错误，因为它会引发错误，因为它会因为'DataFrameGroupBy' object has no attribute 'unique' ：

df_1 = df[df["score"] > 0.5].groupby(['language', 'filename']).unique().size()
    
print("Number of unique text greater than 0.5 score:{}".format(df_1))

解决此问题以达到预期结果的最理想方法是什么？

Answer 1

使用DataFrameGroupBy.nunique并指定列text来计算唯一值的数量：

df_1 = df[df["score"] > 0.5].groupby(['language', 'filename'], sort=False)['text'].nunique()

print("Number of unique text greater than 0.5 score:\n{}".format(df_1))
Number of unique text greater than 0.5 score:
language  filename
chinese   File 1      3
          File 2      1
Arabic    File 2      1
          File 1      2
          File 3      1
Name: text, dtype: int64

Pandas dataframe 在其他列中找到每个组的不同值计数

问题描述

1 个解决方案

解决方案1
1 2021-02-08 05:44:05

Pandas dataframe 在其他列中找到每个组的不同值计数

问题描述

1 个解决方案

解决方案1 1 2021-02-08 05:44:05

解决方案1
1 2021-02-08 05:44:05