Group by with lists in Dataframe

Question

I have a problem with a Dataframe looking like this:

It contains "ClusterLabels" (0-44) and I want to group the "Document" col by the ClusterLabel value. I want These lists from "Document" to be combined in one list per Cluster. (duplicate words sould be kept)

Tryed the ".groupby" argument but it gives the error "sequence item 0: expected str instance, list found".

Can someone help?

Answer 1

Don't use sum to concatenate lists. It looks fancy but it's quadratic and should be considered bad practice.

Better is use list comprehension with flatten lists:

df1 = (df.groupby('ClusterLabel')['Document']
         .agg(lambda x: [z for y in x for z in y])
         .reset_index())

Or flatten in itertools.chain :

from  itertools import chain

df1 = (df.groupby('ClusterLabel')['Document']
         .agg(lambda x: list(chain(*x)))
         .reset_index())

Answer 2

You can do this like:

import pandas as pd

df = pd.DataFrame({"Document" : [["a","b","c","d"],["a","d"],["a","b"],["c","d"],["d"]],
                   "ClusterLabel": [0,0,0,1,1]})

df

df.groupby("ClusterLabel").sum()

Group by with lists in Dataframe

Question

2 answers

solution1
2 2020-05-11 12:39:20

solution2
0 ACCPTED 2020-05-11 12:59:47

Group by with lists in Dataframe

Question

2 answers

solution1 2 2020-05-11 12:39:20

solution2 0 ACCPTED 2020-05-11 12:59:47

solution1
2 2020-05-11 12:39:20

solution2
0 ACCPTED 2020-05-11 12:59:47