從訓練語料庫提取名詞短語並使用NLTK刪除停用詞時出錯

Question

我是python和NLTK的新手。 我必須從語料庫中提取名詞短語，然后使用NLTK刪除停用詞。 我已經進行了編碼，但是仍然有錯誤。 誰能幫我解決這個問題？ 或者，如果有更好的解決方案，也請提出建議。 謝謝

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

docid='19509'
title='Example noun-phrase and stop words'
print('Document id:'),docid
print('Title:'),title

#list noun phrase
content='This is a sample sentence, showing off the stop words filtration.'
is_noun = lambda pos: pos[:2] == 'NN'
tokenized = nltk.word_tokenize(content)
nouns = [word for (word,pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print('All Noun Phrase:'),nouns

#remove stop words
stop_words = set(stopwords.words("english"))

example_words = word_tokenize(nouns)
filtered_sentence = []

for w in example_words:
  if w not in stop_words:
     filtered_sentence.append(w)

print('Without stop words:'),filtered_sentence

我得到了以下錯誤

Traceback (most recent call last):
 File "C:\Users\User\Desktop\NLP\stop_word.py", line 20, in <module>
  example_words = word_tokenize(nouns)
 File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 109,in 
 word_tokenize
  return [token for sent in sent_tokenize(text, language)
 File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 94, in 
 sent_tokenize
  return tokenizer.tokenize(text)
 File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in 
 tokenize
  return list(self.sentences_from_text(text, realign_boundaries))
 File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in 
 sentences_from_text
  return [text[s:e] for s, e in self.span_tokenize(text,realign_boundaries)]
 File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in 
 span_tokenize
  return [(sl.start, sl.stop) for sl in slices]
 File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in 
 _realign_boundaries
  for sl1, sl2 in _pair_iter(slices):
 File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 310, in 
 _pair_iter
  prev = next(it)
 File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in 
 _slices_from_text
  for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

Answer 1

之所以會出現此錯誤，是因為word_tokenize函數期望將字符串作為參數，並給出了字符串列表。 據我了解您要實現的目標，此時您不需要標記化。 在print('All Noun Phrase:'),nouns ，您擁有句子中的所有名詞。 要刪除停用詞，可以使用：

### remove stop words ###
stop_words = set(stopwords.words("english"))
# find the nouns that are not in the stopwords
nouns_without_stopwords = [noun for noun in nouns if noun not in stop_words]
# your sentence is now clear
print('Without stop words:',nouns_without_stopwords)

當然，在這種情況下，名詞的結果相同，因為所有名詞都不是停用詞。

我希望這有幫助。

從訓練語料庫提取名詞短語並使用NLTK刪除停用詞時出錯

問題描述

1 個解決方案

解決方案1
1 已采納 2017-04-06 10:36:21

從訓練語料庫提取名詞短語並使用NLTK刪除停用詞時出錯

問題描述

1 個解決方案

解決方案1 1 已采納 2017-04-06 10:36:21

解決方案1
1 已采納 2017-04-06 10:36:21