![](/img/trans.png)
[英]How to match a string list with keywords (matched or unmatched) and add them in the new column with ReEx
[英]How to tag keywords and add to new column in Python
我正在嘗試使用下面的代碼提取句子中的標簽,但它返回的是關鍵字。 我錯過了什么? 我怎樣才能 output 所有標簽(而不是關鍵字)用逗號分隔的新列?
s = set(dict_list)
f = lambda x: ', '.join(set([y for y in x.split() if y in s]))
# df['tags'] = df['description_summary'].apply(f)
df['tags'] = df['description_summary'].apply(lambda x: ', '.join(set(x.split()).intersection(s)))
df
這基本上是我在 excel 文件中使用的數據:
description_summary
0 Long sentence with keywords ball and hot
1 Long sentence with keywords stick, glove, and cold
這是當前(錯誤的)output:
description_summary keywords instead of tags
0 Long sentence with keywords ball and hot ball, hot
1 Long sentence with keywords cold, stick, and glove cold, stick, glove
這是我想要的 output:
description_summary tags
0 Long sentence with keywords ball and hot toy, temperature
1 Long sentence with keywords cold, stick, and glove temperature, toy
這是關鍵字和標簽的字典('keywords':'tags'):
dict_list = {'Hot': 'Temperature',
'Cold': 'Temperature',
'Very cold': 'Temperature',
'Ball': 'Toy',
'Glove': 'Toy',
'Stick': 'Toy'
}
我怎樣才能 output 在同一個文件的新列中只有標簽(用逗號分隔)?
您可以使用普通的字典索引來返回關聯值,而不是鍵本身。
請注意,我已經根據您的問題編輯了字典列表,以便更輕松地驗證它是否有效,並且您還需要考慮區分大小寫。
df = pd.DataFrame({'description_summary':['Long sentence with keywords ball and hot',
'Long sentence with keywords cold, stick, and glove']})
dict_list = {'Hot': 'Temperature (hot)',
'Cold': 'Temperature (cold)',
'Very cold': 'Temperature (very cold)',
'Ball': 'Toy (ball)',
'Glove': 'Toy (glove)',
'Stick': 'Toy (stick)'}
d_lower = {key.lower():value.lower() for key, value in dict_list.items()}
df['tags'] = df['description_summary'].apply(lambda x: ', '.join(
set([d_lower[y] for y in d_lower.keys() if y in x])
))
產生'tags'
0 temperature (hot), toy (ball)
1 temperature (cold), toy (glove), toy (stick)
Name: tags, dtype: object
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.