[英]Pandas: How to delete duplicates in rows and do multiple topic matching
我有以下 dataframe dfstart
,其中第一列包含包含各種不同主題的不同評論。 標簽列包含與主題相關聯的關鍵字。
使用第二個matchlist
我想創建最終的 dataframe dffinal
,其中對於每條評論,您都可以看到該評論中出現的標簽和主題。 我還希望標簽每行只出現一次。
我嘗試通過for loop
消除重復的標簽
for label in matchlist['label']:
if dfstart[label[n]] == dfstart[label[n-1]]:
dfstart['label'] == np.nan
但是,這似乎不起作用。 此外,我已經設法將dfstart
與matchlist
合並,以在 dataframe 中顯示第一個主題。 我使用的代碼是
df2 = pd.merge(df, matchlist, on='label1')
當然,我可以繼續重命名匹配列表中的matchlist
列並不斷重復該過程,但這將花費很長時間並且效率不高,因為我真正的 dataframe 比這個玩具示例大得多。 所以我想知道是否有更優雅的方式來做到這一點。
以下是三個玩具數據框:
d = {'comment':["comment1","comment2","comment3"], 'label': ["boxing, election, rain", "boxing, boxing", "election, rain, election"]}
dfstart = pd.DataFrame(data=d)
dfstart[['label1','label2', 'label3']] = dfstart.label.str.split(",",expand=True,)
d3 = {'label':["boxing","election","rain"], 'topic': ["sport","politics","weather"]}
matchlist = pd.DataFrame(data=d3)
d2 = {'comment':["comment1","comment2","comment3"],'label': ["boxing, election, rain", "boxing, boxing", "election, rain, election"], 'label1':["boxing", "boxing", "election"], 'label2':["election", np.nan, "rain"], 'label3':["rain", np.nan, np.nan], 'topic1':["sports", "sports", "politics"], 'topic2':["politics", np.nan, "weather"], 'topic3':["weather", np.nan, np.nan]}
dffinal = pd.DataFrame(data=d2)
謝謝你的幫助!
使用str.extractall
而不是str.split
這樣您就可以獲得一個 go 中的所有匹配項,然后將結果和map
展平到您的matchlist
中,最后將所有匹配項concat
在一起:
d = {'comment':["comment1","comment2","comment3"],
'label': ["boxing, election, rain", "boxing, boxing", "election, rain, election"]}
df = pd.DataFrame(d)
matchlist = pd.DataFrame({'label':["boxing","election","rain"], 'topic':["sport","politics","weather"]})
s = matchlist.set_index("label")["topic"]
found = (df["label"].str.extractall("|".join(f"(?P<label{num}>{i})" for num, i in enumerate(s.index, 1)))
.groupby(level=0).first())
print (pd.concat([df, found,
found.apply(lambda d: d.map(s))
.rename(columns={f"label{i+1}":f"topic{i+1}" for i in range(1, 4)})], axis=1) )
comment label label1 label2 label3 label1 topic2 topic3
0 comment1 boxing, election, rain boxing election rain sport politics weather
1 comment2 boxing, boxing boxing NaN NaN sport NaN NaN
2 comment3 election, rain, election NaN election rain NaN politics weather
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.