從 python 數據框列表中刪除重復項

Question

我有一個 Pandas df，其中每一行都是一個單詞列表。 該列表有重復的詞。 我想刪除重復的單詞。

我嘗試在 for 循環中使用 dict.fromkeys(listname) 來遍歷 df 中的每一行。 但這會將單詞拆分為字母

filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')

df["newlist"] = df["text_lemmatized"]
for i in range(0,len(df)):
    l = df["text_lemmatized"][i]
    df["newlist"][i] = list(dict.fromkeys(l))

print(df)

預期結果是 ==>

['clear', 'pending', 'order', 'pending', 'order']   ['clear', 'pending', 'order']
 ['pending', 'activation', 'clear', 'pending']   ['pending', 'activation', 'clear']

實際結果是

['clear', 'pending', 'order', 'pending', 'order']  ...   [[, ', c, l, e, a, r, ,,  , p, n, d, i, g, o, ]]
['pending', 'activation', 'clear', 'pending', ...  ...  [[, ', p, e, n, d, i, g, ,,  , a, c, t, v, o, ...

Answer 1

使用set刪除重復項。

你也不需要 for 循環

  df["newlist"] = list(set( df["text_lemmatized"] ))

Answer 2

只需使用series.map和np.unique

您的樣本數據：

Out[43]:
                           text_lemmatized
0  [clear, pending, order, pending, order]
1    [pending, activation, clear, pending]

df.text_lemmatized.map(np.unique)

Out[44]:
    0         [clear, order, pending]
    1    [activation, clear, pending]
    Name: val, dtype: object

如果你喜歡它不排序，使用pd.unique

df.text_lemmatized.map(pd.unique)

Out[51]:
0         [clear, pending, order]
1    [pending, activation, clear]
Name: text_lemmatized, dtype: object

Answer 3

df.drop_duplicates(subset ="text_lemmatized", 
                     keep = First, inplace = True)

keep = First，表示保留第一次出現

Answer 4

問題是沒有列表，而是字符串，因此有必要通過ast.literal_eval將每個值轉換為列表，然后可以將值轉換為set s 以刪除重復項：

import ast

df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(ast.literal_eval(x))))
print(df)
                           text_lemmatized                       newlist
0  [clear, pending, order, pending, order]       [clear, pending, order]
1    [pending, activation, clear, pending]  [clear, activation, pending]

或使用dict.fromkeys ：

f = lambda x: list(dict.fromkeys(ast.literal_eval(x)))
df['newlist'] = df['text_lemmatized'].map(f)

另一個想法是在一個步驟text_lemmatized列text_lemmatized轉換為列表，然后在另一步驟中刪除重復項，優點是列text_lemmatized列表用於下一步處理：

df['text_lemmatized'] = df['text_lemmatized'].map(ast.literal_eval)
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))

編輯：

經過一些討論，解決方案是：

df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))

Answer 5

您用於刪除重復項的代碼似乎很好。 我嘗試跟隨並且效果很好。 猜猜問題是您在數據框列中附加列表的方式。

`list_from_df = [['clear', 'pending', 'order', 'pending', 'order'],
            ['pending', 'activation', 'clear', 'pending']] 

list_with_unique_words = []

for x in list_from_df:

    unique_words = list(dict.fromkeys(x))
    list_with_unique_words.append(unique_words)

print(list_with_unique_words)

輸出 [['clear', 'pending', 'order'], ['pending', 'activation', 'clear']]

df["newlist"] = list_with_unique_words

df

`

Answer 6

解決方案是==>

import pandas as pd
filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')
df["newlist"] = df["text_lemmatized"]
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
print(df)

感謝 jezrael 和所有其他人幫助縮小到這個解決方案

從 python 數據框列表中刪除重復項

問題描述

6 個解決方案

解決方案1
4 2019-07-19 07:06:01

解決方案2
2 2019-07-19 07:16:10

解決方案3
0 2019-07-19 07:06:28

解決方案4
0 2019-07-19 07:32:23

解決方案5
0 2019-07-19 07:33:43

解決方案6
0 2019-07-19 12:17:43

從 python 數據框列表中刪除重復項

問題描述

6 個解決方案

解決方案1 4 2019-07-19 07:06:01

解決方案2 2 2019-07-19 07:16:10

解決方案3 0 2019-07-19 07:06:28

解決方案4 0 2019-07-19 07:32:23

解決方案5 0 2019-07-19 07:33:43

解決方案6 0 2019-07-19 12:17:43

解決方案1
4 2019-07-19 07:06:01

解決方案2
2 2019-07-19 07:16:10

解決方案3
0 2019-07-19 07:06:28

解決方案4
0 2019-07-19 07:32:23

解決方案5
0 2019-07-19 07:33:43

解決方案6
0 2019-07-19 12:17:43