简体   繁体   English

如何对一列进行分组并计算另一列中唯一值的数量

[英]How to groupby a column and count the number of unique values in another column

I have the following dataframe. I need to groupby the ngram, and for each group, count how many unique documents are present in the DocID column.我有以下 dataframe。我需要对 ngram 进行分组,并且对于每个组,计算 DocID 列中存在多少个唯一文档。

在此处输入图像描述

For example, from the above例如,从上面

4-gram group - 4 as number of unique documents (doc64,doc383,doc76,doc370)
5-gram - 4 
6-gram - 4
7-gram - 2
8-gram - 2

I have an idea in bits.我有一个想法。 I can get the unique DocIDs as follows:我可以获得唯一的 DocID,如下所示:

#Get all the docs of repeated summaries in one list as a list of lists.
rep = []
rep += temp['DocID'].str.split(",").tolist()

# Put all values in one list.
repSet = []
for i in range(len(rep)):
    repSet.extend(rep[i])

# Remove all duplicates and store in a list.
repSet = list(set(repSet))

But I don't know how to merge this with groupby.但我不知道如何将它与 groupby 合并。

EDIT编辑

I have added the output from the first answer provided.我从提供的第一个答案中添加了 output。 Thank you.谢谢你。 But the total number of documents are only 461: So I believe the maximum value of the DocID can go up to only that much:( but for the trigram its above 461 :(但是文档总数只有 461: 所以我相信 DocID 的最大值可以达到 go 最多:( 但对于 trigram 来说它超过 461 :(

在此处输入图像描述

Help will be greatly appreciated.帮助将不胜感激。 Thanks!谢谢!

Maybe something like this?也许是这样的?

df.assign(docid=df['docid'].str.split(',')).explode('docid').groupby('ngram')['docid'].nunique().reset_index()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM