[英]Convert Averaged Perceptron Tagger POS to WordNet POS and Avoid Tuple Error
我有使用NLTK的平均感知器標記器進行POS標記的代碼:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
string = 'dogs runs fast'
tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)
結果:
[('dogs', 'NNS'), ('runs', 'VBZ'), ('fast', 'RB')]
我嘗試了用於循環遍歷每個標記令牌並使用WordNet lemmatizer對其進行lemmatizer的代碼:
lemmatizedWords = []
for w in tokensPOS:
lemmatizedWords.append(WordNetLemmatizer().lemmatize(w))
print(lemmatizedWords)
產生的錯誤:
Traceback (most recent call last):
File "<ipython-input-30-462d7c3bdbb7>", line 15, in <module>
lemmatizedWords = WordNetLemmatizer().lemmatize(w)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
lemmas = wordnet._morphy(word, pos)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
forms = apply_rules([form])
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
for form in forms
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
if form.endswith(old)]
AttributeError: 'tuple' object has no attribute 'endswith'
我認為我在這里有兩個問題:
os
相關的代碼外,我在此錯誤中找不到太多的東西) 如何通過詞形跟蹤跟進POS標簽以避免這些錯誤?
Python解釋器明確告訴您:
AttributeError: 'tuple' object has no attribute 'endswith'
tokensPOS
是一個元組數組,因此您不能將其元素直接傳遞給lemmatize()
方法(在這里查看類WordNetLemmatizer
代碼)。 只有字符串類型的對象才具有方法endswith()
,因此您需要從tokenPOS
傳遞每個元組的第一個元素,如下所示:
lemmatizedWords = []
for w in tokensPOS:
lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0]))
方法lemmatize()
使用wordnet.NOUN
作為默認POS。 不幸的是,Wordnet使用的標簽不同於其他nltk語料庫,因此您必須手動翻譯它們(如您提供的鏈接中所示),並使用適當的標簽作為lemmatize()
的第二個參數。 完整腳本,帶有來自此答案的 get_wordnet_pos()
方法:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''
string = 'dogs runs fast'
tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)
lemmatizedWords = []
for w in tokensPOS:
lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0],get_wordnet_pos(w[1])))
print(lemmatizedWords)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.