[英]Remove dupes from a data frame column that is a list format
I have ton of duplicate values in a data frame column by row. 我在数据帧中逐行有大量重复值。 Below is some sample, I looked at other stack overflow question, but I can only find the answer for the list not for the data frame issue dupes.
下面是一些示例,我查看了其他堆栈溢出问题,但是我只能找到列表的答案,而不是数据帧问题重复的答案。 When I pass values in a list, I am able to remove duplicate values however, when I pass it like a data frame it is giving errors:
TypeError: unhashable type: 'list'
当我在列表中传递值时,我可以删除重复的值,但是,当我像数据框一样传递它时,它会给出错误:
TypeError: unhashable type: 'list'
What am I doing wrong here? 我在这里做错了什么?
import pandas as pd
d = {'col1': ['apples are delicious,apples are delicious,apples', 'apples'], 'col2': ['mangoes','oranges']}
df = pd.DataFrame(data=d)
df['col1'] = set(df['col1'].str.split(",")) #error tried list(set()) as well.
df['col2'] = df['col2'].str.split(",") #converting to list
print(df)
final output should remove dupes like this: 最终输出应删除重复项,如下所示:
col1 co2
['apples are delicious','apples'] ['mangoes']
['apples'] ['oranges']
You are using set
on an entire series, whereas you need to apply set
to each element in the series. 您正在整个系列上使用
set
,而您需要将set
应用于set
中的每个元素 。 For this, you can use pd.Series.map
: 为此,您可以使用
pd.Series.map
:
df['col1'] = df['col1'].str.split(',').map(set)
print(df)
col1 col2
0 {apples are delicious, apples} [mangoes]
1 {apples} [oranges]
Your error derives from the fact you can't have a set
of lists since lists are not hashable. 您的错误源于以下事实:由于列表不可哈希,因此您无法拥有一
set
列表。
If you really need a series of lists as the result, you can use the same method again, ie df['col1'].str.split(',').map(set).map(list)
. 如果确实需要一系列列表作为结果,则可以再次使用相同的方法,即
df['col1'].str.split(',').map(set).map(list)
。 But note that you should assume no ordering as set
is an unordered collection. 但是请注意,您不应该假设
set
中的任何排序都是无序集合。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.