簡體   English   中英

將平均感知器Tagger POS轉換為WordNet POS並避免元組錯誤

[英]Convert Averaged Perceptron Tagger POS to WordNet POS and Avoid Tuple Error

我有使用NLTK的平均感知器標記器進行POS標記的代碼:

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

string = 'dogs runs fast'

tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

結果:

[('dogs', 'NNS'), ('runs', 'VBZ'), ('fast', 'RB')]

我嘗試了用於循環遍歷每個標記令牌並使用WordNet lemmatizer對其進行lemmatizer的代碼:

lemmatizedWords = []
for w in tokensPOS:
       lemmatizedWords.append(WordNetLemmatizer().lemmatize(w))

print(lemmatizedWords)

產生的錯誤:

Traceback (most recent call last):

  File "<ipython-input-30-462d7c3bdbb7>", line 15, in <module>
    lemmatizedWords = WordNetLemmatizer().lemmatize(w)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
    lemmas = wordnet._morphy(word, pos)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
    forms = apply_rules([form])

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
    for form in forms

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
    if form.endswith(old)]

AttributeError: 'tuple' object has no attribute 'endswith'

我認為我在這里有兩個問題:

  1. POS標簽未轉換為WordNet可以理解的標簽(我嘗試在python中實現類似於此答案wordnet lemmatization和pos標簽的操作
  2. 數據結構的格式不正確,無法遍歷每個元組(除了與os相關的代碼外,我在此錯誤中找不到太多的東西)

如何通過詞形跟蹤跟進POS標簽以避免這些錯誤?

Python解釋器明確告訴您:

AttributeError: 'tuple' object has no attribute 'endswith'

tokensPOS是一個元組數組,因此您不能將其元素直接傳遞給lemmatize()方法(在這里查看類WordNetLemmatizer代碼)。 只有字符串類型的對象才具有方法endswith() ,因此您需要從tokenPOS傳遞每個元組的第一個元素,如下所示:

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0]))   

方法lemmatize()使用wordnet.NOUN作為默認POS。 不幸的是,Wordnet使用的標簽不同於其他nltk語料庫,因此您必須手動翻譯它們(如您提供的鏈接中所示),並使用適當的標簽作為lemmatize()的第二個參數。 完整腳本,帶有來自此答案的 get_wordnet_pos()方法:

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

string = 'dogs runs fast'

tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0],get_wordnet_pos(w[1])))

print(lemmatizedWords)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM