简体   繁体   English

从训练语料库提取名词短语并使用NLTK删除停用词时出错

[英]Error when extract noun-phrases from the training corpus and remove stop words using NLTK

I am new to both, python and NLTK. 我是python和NLTK的新手。 I have to extract noun phrase from the corpus and then remove the stop words by using NLTK. 我必须从语料库中提取名词短语,然后使用NLTK删除停用词。 I already do my coding but still have error. 我已经进行了编码,但是仍然有错误。 Can anyone help me to fix this problem? 谁能帮我解决这个问题? Or please also recommend if there is any better solution. 或者,如果有更好的解决方案,也请提出建议。 Thank you 谢谢

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

docid='19509'
title='Example noun-phrase and stop words'
print('Document id:'),docid
print('Title:'),title

#list noun phrase
content='This is a sample sentence, showing off the stop words filtration.'
is_noun = lambda pos: pos[:2] == 'NN'
tokenized = nltk.word_tokenize(content)
nouns = [word for (word,pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print('All Noun Phrase:'),nouns

#remove stop words
stop_words = set(stopwords.words("english"))

example_words = word_tokenize(nouns)
filtered_sentence = []

for w in example_words:
  if w not in stop_words:
     filtered_sentence.append(w)

print('Without stop words:'),filtered_sentence

And I got the following error 我得到了以下错误

Traceback (most recent call last):
 File "C:\Users\User\Desktop\NLP\stop_word.py", line 20, in <module>
  example_words = word_tokenize(nouns)
 File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 109,in 
 word_tokenize
  return [token for sent in sent_tokenize(text, language)
 File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 94, in 
 sent_tokenize
  return tokenizer.tokenize(text)
 File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in 
 tokenize
  return list(self.sentences_from_text(text, realign_boundaries))
 File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in 
 sentences_from_text
  return [text[s:e] for s, e in self.span_tokenize(text,realign_boundaries)]
 File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in 
 span_tokenize
  return [(sl.start, sl.stop) for sl in slices]
 File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in 
 _realign_boundaries
  for sl1, sl2 in _pair_iter(slices):
 File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 310, in 
 _pair_iter
  prev = next(it)
 File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in 
 _slices_from_text
  for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

You are getting this error because the function word_tokenize is expecting a string as an argument and you give a list of strings. 之所以会出现此错误,是因为word_tokenize函数期望将字符串作为参数,并给出了字符串列表。 As far as I understand what you are trying to achieve, you do not need tokenize at this point. 据我了解您要实现的目标,此时您不需要标记化。 Until the print('All Noun Phrase:'),nouns , you have all the nouns of your sentence. print('All Noun Phrase:'),nouns ,您拥有句子中的所有名词。 To remove the stopwords, you can use: 要删除停用词,可以使用:

### remove stop words ###
stop_words = set(stopwords.words("english"))
# find the nouns that are not in the stopwords
nouns_without_stopwords = [noun for noun in nouns if noun not in stop_words]
# your sentence is now clear
print('Without stop words:',nouns_without_stopwords)

Of course, in this case you have the same result with nouns, because none of the nouns was a stopword. 当然,在这种情况下,名词的结果相同,因为所有名词都不是停用词。

I hope this helps. 我希望这有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM