[英]How to groupby a column and count the number of unique values in another column
I have the following dataframe. I need to groupby the ngram, and for each group, count how many unique documents are present in the DocID column.我有以下 dataframe。我需要对 ngram 进行分组,并且对于每个组,计算 DocID 列中存在多少个唯一文档。
For example, from the above例如,从上面
4-gram group - 4 as number of unique documents (doc64,doc383,doc76,doc370)
5-gram - 4
6-gram - 4
7-gram - 2
8-gram - 2
I have an idea in bits.我有一个想法。 I can get the unique DocIDs as follows:我可以获得唯一的 DocID,如下所示:
#Get all the docs of repeated summaries in one list as a list of lists.
rep = []
rep += temp['DocID'].str.split(",").tolist()
# Put all values in one list.
repSet = []
for i in range(len(rep)):
repSet.extend(rep[i])
# Remove all duplicates and store in a list.
repSet = list(set(repSet))
But I don't know how to merge this with groupby.但我不知道如何将它与 groupby 合并。
EDIT编辑
I have added the output from the first answer provided.我从提供的第一个答案中添加了 output。 Thank you.谢谢你。 But the total number of documents are only 461: So I believe the maximum value of the DocID can go up to only that much:( but for the trigram its above 461 :(但是文档总数只有 461: 所以我相信 DocID 的最大值可以达到 go 最多:( 但对于 trigram 来说它超过 461 :(
Help will be greatly appreciated.帮助将不胜感激。 Thanks!谢谢!
Maybe something like this?也许是这样的?
df.assign(docid=df['docid'].str.split(',')).explode('docid').groupby('ngram')['docid'].nunique().reset_index()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.