如何對一列進行分組並計算另一列中唯一值的數量

Question

我有以下 dataframe。我需要對 ngram 進行分組，並且對於每個組，計算 DocID 列中存在多少個唯一文檔。

例如，從上面

4-gram group - 4 as number of unique documents (doc64,doc383,doc76,doc370)
5-gram - 4 
6-gram - 4
7-gram - 2
8-gram - 2

我有一個想法。 我可以獲得唯一的 DocID，如下所示：

#Get all the docs of repeated summaries in one list as a list of lists.
rep = []
rep += temp['DocID'].str.split(",").tolist()

# Put all values in one list.
repSet = []
for i in range(len(rep)):
    repSet.extend(rep[i])

# Remove all duplicates and store in a list.
repSet = list(set(repSet))

但我不知道如何將它與 groupby 合並。

編輯

我從提供的第一個答案中添加了 output。 謝謝你。 但是文檔總數只有 461：所以我相信 DocID 的最大值可以達到 go 最多:( 但對於 trigram 來說它超過 461 :(

幫助將不勝感激。 謝謝！

Answer 1

也許是這樣的？

df.assign(docid=df['docid'].str.split(',')).explode('docid').groupby('ngram')['docid'].nunique().reset_index()

如何對一列進行分組並計算另一列中唯一值的數量

問題描述

1 個解決方案

解決方案1
0 已采納

如何對一列進行分組並計算另一列中唯一值的數量

問題描述

1 個解決方案

解決方案1 0 已采納

解決方案1
0 已采納