[英]Pandas: Aggregate each column into a comma separated list without duplicates
Problem: 问题:
I have a large CSV file which looks something like this: 我有一个很大的CSV文件,看起来像这样:
A B C D ...
1 dog black NULL ...
1 dog white NULL ...
1 dog black NULL ...
2 cat red NULL ...
...
Now I want to "group by" column A
and aggregate each remaining column to a comma separated list without duplicates. 现在,我想对“
A
”列进行“分组”,并将其余各列聚合到一个逗号分隔的列表中,而不重复。 The solutions should look something like this: 解决方案应如下所示:
A B C D ...
1 dog black, white NULL ...
2 cat red NULL ...
...
Since the names and numbers of columns in the CSV may change, I prefer a solution without hard coded names. 由于CSV中的名称和列数可能会发生变化,因此我更喜欢没有硬编码名称的解决方案。
Used Approach: 二手方法:
I tried the package pandas
with the following code: 我用以下代码尝试了
pandas
软件包:
import pandas as pd
data = pd.read_csv("C://input.csv", sep=';')
data = data.where((pd.notnull(data)), None)
data_group = data.groupby(['A']).agg(lambda x: set(x))
data_group.to_csv("C://result.csv", sep=';')
The set
operator does exactly what I want. set
运算符恰好满足了我的要求。 However, the resulting CSV looks like this: 但是,生成的CSV如下所示:
A B C D ...
1 {'dog'} {'black', 'white'} {None} ...
2 {'cat'} {'red'} {None} ...
...
I don't want the {}
and ''
in my export and also column D
should be empty and not containing the word None
. 我不希望在导出中使用
{}
和''
,并且D
列也应该为空并且不包含单词None
。
Question: 题:
Am I on the right track, or is there a much more elegant way to achieve my goal? 我是在正确的道路上,还是有一种更优雅的方法来实现自己的目标?
join
the set with comma: 用逗号
join
集合:
df.groupby('A', as_index=False).agg(lambda x: ', '.join(set(x.dropna())))
# A B C D
#0 1 dog white, black
#1 2 cat red
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.