如何对一列进行分组并计算另一列中唯一值的数量

Question

I have the following dataframe. I need to groupby the ngram, and for each group, count how many unique documents are present in the DocID column.我有以下 dataframe。我需要对 ngram 进行分组，并且对于每个组，计算 DocID 列中存在多少个唯一文档。

For example, from the above例如，从上面

4-gram group - 4 as number of unique documents (doc64,doc383,doc76,doc370)
5-gram - 4 
6-gram - 4
7-gram - 2
8-gram - 2

I have an idea in bits.我有一个想法。 I can get the unique DocIDs as follows:我可以获得唯一的 DocID，如下所示：

#Get all the docs of repeated summaries in one list as a list of lists.
rep = []
rep += temp['DocID'].str.split(",").tolist()

# Put all values in one list.
repSet = []
for i in range(len(rep)):
    repSet.extend(rep[i])

# Remove all duplicates and store in a list.
repSet = list(set(repSet))

But I don't know how to merge this with groupby.但我不知道如何将它与 groupby 合并。

EDIT编辑

I have added the output from the first answer provided.我从提供的第一个答案中添加了 output。 Thank you.谢谢你。 But the total number of documents are only 461: So I believe the maximum value of the DocID can go up to only that much:( but for the trigram its above 461 :(但是文档总数只有 461：所以我相信 DocID 的最大值可以达到 go 最多:( 但对于 trigram 来说它超过 461 :(

Help will be greatly appreciated.帮助将不胜感激。 Thanks!谢谢！

Answer 1

Maybe something like this?也许是这样的？

df.assign(docid=df['docid'].str.split(',')).explode('docid').groupby('ngram')['docid'].nunique().reset_index()

如何对一列进行分组并计算另一列中唯一值的数量

问题描述

1 个解决方案

解决方案1
0 已采纳

如何对一列进行分组并计算另一列中唯一值的数量

问题描述

1 个解决方案

解决方案1 0 已采纳

解决方案1
0 已采纳