[英]Replace words by checking from pandas dataframe
我有一個如下的數據框。
ID Word Synonyms
------------------------
1 drove drive
2 office downtown
3 everyday daily
4 day daily
5 work downtown
我正在閱讀一個句子,並想用上面定義的同義詞替換該句子中的單詞。 這是我的代碼:
import nltk
import pandas as pd
import string
sdf = pd.read_excel('C:\synonyms.xlsx')
sd = sdf.apply(lambda x: x.astype(str).str.lower())
words = 'i drove to office everyday in my car'
#######
def tokenize(text):
text = ''.join([ch for ch in text if ch not in string.punctuation])
tokens = nltk.word_tokenize(text)
synonym = synonyms(tokens)
return synonym
def synonyms(words):
for word in words:
if(sd[sd['Word'] == word].index.tolist()):
idx = sd[sd['Word'] == word].index.tolist()
word = sd.loc[idx]['Synonyms'].item()
else:
word
return word
print(tokenize(words))
上面的代碼標記了輸入句子。 我想實現以下輸出:
在: i drove to office everyday in my car
外出: i drive to downtown daily in my car
但我得到的輸出是
出: car
如果我跳過synonyms
功能,那么我的輸出就沒有問題並且會被拆分為單個單詞。 我試圖了解我在synonyms
功能中做錯了什么。 另外,請告知是否有更好的解決方案來解決此問題。
我會利用 Pandas/NumPy 索引。 由於您的同義詞映射是多對一的,您可以使用Word
列重新索引。
sd = sd.applymap(str.strip).applymap(str.lower).set_index('Word').Synonyms
print(sd)
Word
drove drive
office downtown
everyday daily
day daily
Name: Synonyms, dtype: object
然后,您可以輕松地將標記列表與其各自的同義詞對齊。
words = nltk.word_tokenize(u'i drove to office everyday in my car')
sentence = sd[words].reset_index()
print(sentence)
Word Synonyms
0 i NaN
1 drove drive
2 to NaN
3 office downtown
4 everyday daily
5 in NaN
6 my NaN
7 car NaN
現在,它仍然使用來自Synonyms
的標記,回退到Word
。 這可以通過
sentence = sentence.Synonyms.fillna(sentence.Word)
print(sentence.values)
[u'i' 'drive' u'to' 'downtown' 'daily' u'in' u'my' u'car']
import re
import pandas as pd
sdf = pd.read_excel('C:\synonyms.xlsx')
rep = dict(zip(sdf.Word, sdf.Synonyms)) #convert into dictionary
words = "i drove to office everyday in my car"
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
rep = pattern.sub(lambda m: rep[re.escape(m.group(0))], words)
print rep
輸出
i drive to downtown daily in my car
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.