简体   繁体   中英

Punctuation, stopwords and lemmatization with spacy

I'm trying to apply punctuation removal, stopwords removal and lemmatization to a list of strings

I tried to use lemma_ , is_stop and is_punct

data = ['We will pray and hope for the best', 
    'Though it may not make landfall all week if it follows that track',
    'Heavy rains, capable of producing life-threatening flash floods, are possible']

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

nlp = spacy.load("en")

doc = list(nlp.pipe(data))

data_clean = [[w.lemma_ for w in doc if not w.is_stop and not w.is_punct and not w.like_num] for doc in data]

I have the following error: AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'lemma_'

(same problem for is_stop and is_punct )

You iterate over the unprocessed list of strings data in the outer-loop, but you need to iterate over doc . Further, your variables have unfavorable names, the following naming should be less confusing:

docs = list(nlp.pipe(data))
data_clean = [[w.lemma_ for w in doc if (not w.is_stop and not w.is_punct and not w.like_num)] for doc in docs]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM