[英]Remove duplicates from python dataframe list
我有一個 Pandas df,其中每一行都是一個單詞列表。 該列表有重復的詞。 我想刪除重復的單詞。
我嘗試在 for 循環中使用 dict.fromkeys(listname) 來遍歷 df 中的每一行。 但這會將單詞拆分為字母
filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')
df["newlist"] = df["text_lemmatized"]
for i in range(0,len(df)):
l = df["text_lemmatized"][i]
df["newlist"][i] = list(dict.fromkeys(l))
print(df)
預期結果是 ==>
['clear', 'pending', 'order', 'pending', 'order'] ['clear', 'pending', 'order']
['pending', 'activation', 'clear', 'pending'] ['pending', 'activation', 'clear']
實際結果是
['clear', 'pending', 'order', 'pending', 'order'] ... [[, ', c, l, e, a, r, ,, , p, n, d, i, g, o, ]]
['pending', 'activation', 'clear', 'pending', ... ... [[, ', p, e, n, d, i, g, ,, , a, c, t, v, o, ...
使用set
刪除重復項。
你也不需要 for 循環
df["newlist"] = list(set( df["text_lemmatized"] ))
只需使用series.map
和np.unique
您的樣本數據:
Out[43]:
text_lemmatized
0 [clear, pending, order, pending, order]
1 [pending, activation, clear, pending]
df.text_lemmatized.map(np.unique)
Out[44]:
0 [clear, order, pending]
1 [activation, clear, pending]
Name: val, dtype: object
如果你喜歡它不排序,使用pd.unique
df.text_lemmatized.map(pd.unique)
Out[51]:
0 [clear, pending, order]
1 [pending, activation, clear]
Name: text_lemmatized, dtype: object
df.drop_duplicates(subset ="text_lemmatized",
keep = First, inplace = True)
keep = First,表示保留第一次出現
問題是沒有列表,而是字符串,因此有必要通過ast.literal_eval
將每個值轉換為列表,然后可以將值轉換為set
s 以刪除重復項:
import ast
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(ast.literal_eval(x))))
print(df)
text_lemmatized newlist
0 [clear, pending, order, pending, order] [clear, pending, order]
1 [pending, activation, clear, pending] [clear, activation, pending]
或使用dict.fromkeys
:
f = lambda x: list(dict.fromkeys(ast.literal_eval(x)))
df['newlist'] = df['text_lemmatized'].map(f)
另一個想法是在一個步驟text_lemmatized
列text_lemmatized
轉換為列表,然后在另一步驟中刪除重復項,優點是列text_lemmatized
列表用於下一步處理:
df['text_lemmatized'] = df['text_lemmatized'].map(ast.literal_eval)
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
編輯:
經過一些討論,解決方案是:
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
您用於刪除重復項的代碼似乎很好。 我嘗試跟隨並且效果很好。 猜猜問題是您在數據框列中附加列表的方式。
`list_from_df = [['clear', 'pending', 'order', 'pending', 'order'],
['pending', 'activation', 'clear', 'pending']]
list_with_unique_words = []
for x in list_from_df:
unique_words = list(dict.fromkeys(x))
list_with_unique_words.append(unique_words)
print(list_with_unique_words)
輸出 [['clear', 'pending', 'order'], ['pending', 'activation', 'clear']]
df["newlist"] = list_with_unique_words
df
`
解決方案是==>
import pandas as pd
filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')
df["newlist"] = df["text_lemmatized"]
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
print(df)
感謝 jezrael 和所有其他人幫助縮小到這個解決方案
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.