按 Dataframe 中的列表分组

Question

I have a problem with a Dataframe looking like this:我对看起来像这样的 Dataframe 有疑问：

It contains "ClusterLabels" (0-44) and I want to group the "Document" col by the ClusterLabel value.它包含“ClusterLabels”（0-44），我想按 ClusterLabel 值对“文档”列进行分组。 I want These lists from "Document" to be combined in one list per Cluster.我希望将“文档”中的这些列表合并到每个集群的一个列表中。 (duplicate words sould be kept) （应保留重复的单词）

Tryed the ".groupby" argument but it gives the error "sequence item 0: expected str instance, list found".尝试了“.groupby”参数，但它给出了错误“序列项 0：预期的 str 实例，找到列表”。

Can someone help?有人可以帮忙吗？

Answer 1

Don't use sum to concatenate lists.不要使用 sum 来连接列表。 It looks fancy but it's quadratic and should be considered bad practice.它看起来很花哨，但它是二次的，应该被认为是不好的做法。

Better is use list comprehension with flatten lists:更好的是使用扁平列表的列表理解：

df1 = (df.groupby('ClusterLabel')['Document']
         .agg(lambda x: [z for y in x for z in y])
         .reset_index())

Or flatten in itertools.chain :或者在itertools.chain中展平：

from  itertools import chain

df1 = (df.groupby('ClusterLabel')['Document']
         .agg(lambda x: list(chain(*x)))
         .reset_index())

Answer 2

You can do this like:你可以这样做：

import pandas as pd导入 pandas 作为 pd

df = pd.DataFrame({"Document" : [["a","b","c","d"],["a","d"],["a","b"],["c","d"],["d"]],
                   "ClusterLabel": [0,0,0,1,1]})

df

df.groupby("ClusterLabel").sum()

按 Dataframe 中的列表分组

问题描述

2 个解决方案

解决方案1
2 2020-05-11 12:39:20

解决方案2
0 已采纳 2020-05-11 12:59:47

按 Dataframe 中的列表分组

问题描述

2 个解决方案

解决方案1 2 2020-05-11 12:39:20

解决方案2 0 已采纳 2020-05-11 12:59:47

解决方案1
2 2020-05-11 12:39:20

解决方案2
0 已采纳 2020-05-11 12:59:47