简体   繁体   中英

Pandas: Aggregate each column into a comma separated list without duplicates

Problem:

I have a large CSV file which looks something like this:

A  B   C     D    ...
1  dog black NULL ...
1  dog white NULL ...
1  dog black NULL ...
2  cat red   NULL ...
...

Now I want to "group by" column A and aggregate each remaining column to a comma separated list without duplicates. The solutions should look something like this:

A  B   C             D    ...
1  dog black, white  NULL ...
2  cat red           NULL ...
...

Since the names and numbers of columns in the CSV may change, I prefer a solution without hard coded names.

Used Approach:

I tried the package pandas with the following code:

import pandas as pd
data = pd.read_csv("C://input.csv", sep=';')
data = data.where((pd.notnull(data)), None)
data_group = data.groupby(['A']).agg(lambda x: set(x))
data_group.to_csv("C://result.csv", sep=';')

The set operator does exactly what I want. However, the resulting CSV looks like this:

A  B       C                   D      ...
1  {'dog'} {'black', 'white'}  {None} ...
2  {'cat'} {'red'}             {None} ...
...

I don't want the {} and '' in my export and also column D should be empty and not containing the word None .

Question:

Am I on the right track, or is there a much more elegant way to achieve my goal?

join the set with comma:

df.groupby('A', as_index=False).agg(lambda x: ', '.join(set(x.dropna())))

#   A    B             C D
#0  1  dog  white, black  
#1  2  cat           red  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM