简体   繁体   中英

Group by with lists in Dataframe

I have a problem with a Dataframe looking like this:

在此处输入图像描述

It contains "ClusterLabels" (0-44) and I want to group the "Document" col by the ClusterLabel value. I want These lists from "Document" to be combined in one list per Cluster. (duplicate words sould be kept)

Tryed the ".groupby" argument but it gives the error "sequence item 0: expected str instance, list found".

Can someone help?

Don't use sum to concatenate lists. It looks fancy but it's quadratic and should be considered bad practice.

Better is use list comprehension with flatten lists:

df1 = (df.groupby('ClusterLabel')['Document']
         .agg(lambda x: [z for y in x for z in y])
         .reset_index())

Or flatten in itertools.chain :

from  itertools import chain

df1 = (df.groupby('ClusterLabel')['Document']
         .agg(lambda x: list(chain(*x)))
         .reset_index())

You can do this like:

import pandas as pd

df = pd.DataFrame({"Document" : [["a","b","c","d"],["a","d"],["a","b"],["c","d"],["d"]],
                   "ClusterLabel": [0,0,0,1,1]})

df

在此处输入图像描述

df.groupby("ClusterLabel").sum()

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM