简体   繁体   English

从 python 数据框列表中删除重复项

[英]Remove duplicates from python dataframe list

I have a pandas df where each row is a list of words.我有一个 Pandas df,其中每一行都是一个单词列表。 The list has duplicate words.该列表有重复的词。 I want to remove duplicate words.我想删除重复的单词。

I tried using dict.fromkeys(listname) in a for loop to iterate over each row in the df.我尝试在 for 循环中使用 dict.fromkeys(listname) 来遍历 df 中的每一行。 But this splits the words into alphabets但这会将单词拆分为字母

filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')

df["newlist"] = df["text_lemmatized"]
for i in range(0,len(df)):
    l = df["text_lemmatized"][i]
    df["newlist"][i] = list(dict.fromkeys(l))

print(df)

Expected result is ==>预期结果是 ==>

['clear', 'pending', 'order', 'pending', 'order']   ['clear', 'pending', 'order']
 ['pending', 'activation', 'clear', 'pending']   ['pending', 'activation', 'clear']

Actual result is实际结果是

['clear', 'pending', 'order', 'pending', 'order']  ...   [[, ', c, l, e, a, r, ,,  , p, n, d, i, g, o, ]]
['pending', 'activation', 'clear', 'pending', ...  ...  [[, ', p, e, n, d, i, g, ,,  , a, c, t, v, o, ...

Use set to remove duplicates.使用set删除重复项。

Also you don't need the for loop你也不需要 for 循环

  df["newlist"] = list(set( df["text_lemmatized"] ))

Just use series.map and np.unique只需使用series.mapnp.unique

Your sample data:您的样本数据:

Out[43]:
                           text_lemmatized
0  [clear, pending, order, pending, order]
1    [pending, activation, clear, pending]

df.text_lemmatized.map(np.unique)

Out[44]:
    0         [clear, order, pending]
    1    [activation, clear, pending]
    Name: val, dtype: object

If you prefer it isn't sorted, use pd.unique如果你喜欢它不排序,使用pd.unique

df.text_lemmatized.map(pd.unique)

Out[51]:
0         [clear, pending, order]
1    [pending, activation, clear]
Name: text_lemmatized, dtype: object
df.drop_duplicates(subset ="text_lemmatized", 
                     keep = First, inplace = True) 

keep = First, means keep the first occurrence keep = First,表示保留第一次出现

Problem is there are not lists, but strings, so is necessary convert each value to list by ast.literal_eval , then is possible convert values to set s for remove duplicates:问题是没有列表,而是字符串,因此有必要通过ast.literal_eval将每个值转换为列表,然后可以将值转换为set s 以删除重复项:

import ast

df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(ast.literal_eval(x))))
print(df)
                           text_lemmatized                       newlist
0  [clear, pending, order, pending, order]       [clear, pending, order]
1    [pending, activation, clear, pending]  [clear, activation, pending]

Or use dict.fromkeys :或使用dict.fromkeys

f = lambda x: list(dict.fromkeys(ast.literal_eval(x)))
df['newlist'] = df['text_lemmatized'].map(f)

Another idea is convert column text_lemmatized to lists in one step and then remove duplicates in another step, advantage is lists in column text_lemmatized for next processing:另一个想法是在一个步骤text_lemmatizedtext_lemmatized转换为列表,然后在另一步骤中删除重复项,优点是列text_lemmatized列表用于下一步处理:

df['text_lemmatized'] = df['text_lemmatized'].map(ast.literal_eval)
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))

EDIT:编辑:

After some discussion solution is:经过一些讨论,解决方案是:

df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))

Your code for removing duplicates seems fine.您用于删除重复项的代码似乎很好。 I tried following and it worked well.我尝试跟随并且效果很好。 Guess the problem is the way you are appending the list in the dataframe column.猜猜问题是您在数据框列中附加列表的方式。

`list_from_df = [['clear', 'pending', 'order', 'pending', 'order'],
            ['pending', 'activation', 'clear', 'pending']] 

list_with_unique_words = []

for x in list_from_df:

    unique_words = list(dict.fromkeys(x))
    list_with_unique_words.append(unique_words)

print(list_with_unique_words)

output [['clear', 'pending', 'order'], ['pending', 'activation', 'clear']]输出 [['clear', 'pending', 'order'], ['pending', 'activation', 'clear']]

df["newlist"] = list_with_unique_words

df

` `

我最后的 df

Solution is ==>解决方案是==>

import pandas as pd
filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')
df["newlist"] = df["text_lemmatized"]
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
print(df)

Thanks to jezrael and all others who helped narrow down to this solution感谢 jezrael 和所有其他人帮助缩小到这个解决方案

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM