[英]remove all words that are not nouns, verbs, adjectives, adverbs, or proper names. spacy python
我写了下面的代码,我想打印出前 10 个句子中的单词,我想删除所有不是名词、动词、形容词、副词或专有名称的单词。但我不知道怎么做? 谁能帮我?
! pip install wget
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/moby_dick.txt'
wget.download(url, 'moby_dick.txt')
documents = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]
import spacy
nlp = spacy.load('en')
tokens = [[token.text for token in nlp(sentence)] for sentence in documents[:200]]
pos = [[token.pos_ for token in nlp(sentence)] for sentence in documents[:100]]
pos
您只需要知道哪些 POS 符号用于表示这些实体。 这是Spacy 文档中的列表。 此代码将帮助您满足此要求:
import spacy
nlp = spacy.load('en_core_web_sm') #you can use other methods
# excluded tags
excluded_tags = {"NOUN", "VERB", "ADJ", "ADV", "ADP", "PROPN"}
document = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]
sentences = document[:10] #first 10 sentences
new_sentences = []
for sentence in sentences:
new_sentence = []
for token in nlp(sentence):
if token.pos_ not in excluded_tags:
new_sentence.append(token.text)
new_sentences.append(" ".join(new_sentence))
现在, new_sentences
和之前有相同的句子,但没有任何名词、动词等。您可以通过迭代sentences
和new_sentences
来确保这一点,以查看不同之处:
for old_sen, new_sen in zip(sentences, new_sentences):
print("Before:", old_sen)
print("After:", new_sen)
print()
Before: Loomings .
After: .
Before: Call me Ishmael .
After: me .
Before: Some years ago -- never mind how long precisely -- having little or no money in my purse , and nothing particular to interest me on shore , I thought I would sail about a little and see the watery part of the world .
After: Some -- -- or no my , and nothing to me , I I a and the the .
Before: It is a way I have of driving off the spleen and regulating the circulation .
After: It is a I have the and the .
...
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.