將平均感知器Tagger POS轉換為WordNet POS並避免元組錯誤

Question

我有使用NLTK的平均感知器標記器進行POS標記的代碼：

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

string = 'dogs runs fast'

tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

結果：

[('dogs', 'NNS'), ('runs', 'VBZ'), ('fast', 'RB')]

我嘗試了用於循環遍歷每個標記令牌並使用WordNet lemmatizer對其進行lemmatizer的代碼：

lemmatizedWords = []
for w in tokensPOS:
       lemmatizedWords.append(WordNetLemmatizer().lemmatize(w))

print(lemmatizedWords)

產生的錯誤：

Traceback (most recent call last):

  File "<ipython-input-30-462d7c3bdbb7>", line 15, in <module>
    lemmatizedWords = WordNetLemmatizer().lemmatize(w)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
    lemmas = wordnet._morphy(word, pos)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
    forms = apply_rules([form])

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
    for form in forms

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
    if form.endswith(old)]

AttributeError: 'tuple' object has no attribute 'endswith'

我認為我在這里有兩個問題：

POS標簽未轉換為WordNet可以理解的標簽（我嘗試在python中實現類似於此答案wordnet lemmatization和pos標簽的操作）
數據結構的格式不正確，無法遍歷每個元組（除了與os相關的代碼外，我在此錯誤中找不到太多的東西）

如何通過詞形跟蹤跟進POS標簽以避免這些錯誤？

Answer 1

Python解釋器明確告訴您：

AttributeError: 'tuple' object has no attribute 'endswith'

tokensPOS是一個元組數組，因此您不能將其元素直接傳遞給lemmatize()方法（在這里查看類WordNetLemmatizer代碼）。 只有字符串類型的對象才具有方法endswith() ，因此您需要從tokenPOS傳遞每個元組的第一個元素，如下所示：

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0]))

方法lemmatize()使用wordnet.NOUN作為默認POS。 不幸的是，Wordnet使用的標簽不同於其他nltk語料庫，因此您必須手動翻譯它們（如您提供的鏈接中所示），並使用適當的標簽作為lemmatize()的第二個參數。 完整腳本，帶有來自此答案的 get_wordnet_pos()方法：

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

string = 'dogs runs fast'

tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0],get_wordnet_pos(w[1])))

print(lemmatizedWords)

將平均感知器Tagger POS轉換為WordNet POS並避免元組錯誤

問題描述

1 個解決方案

解決方案1
2 已采納 2017-06-28 16:40:10

將平均感知器Tagger POS轉換為WordNet POS並避免元組錯誤

問題描述

1 個解決方案

解決方案1 2 已采納 2017-06-28 16:40:10

解決方案1
2 已采納 2017-06-28 16:40:10