简体   繁体   中英

Removing noun phrases containing stop words using spaCy

I have been using spaCy to look for most used nouns and noun_phrases

I can successfully get rid of punctuation and stop words when looking for single nouns

docx = nlp('The bird is flying high in the sky blue of color')

# Just looking at nouns
nouns = []
for token in docx:
    if token.is_stop != True and token.is_punct != True and token.pos_ == 'NOUN':
        nouns.append(token)

# Count and look at the most frequent nouns #
word_freq = Counter(nouns)
common_nouns = word_freq.most_common(10)

Using noun_chunks in order to determine phrases however results in Attribute error

noun_phrases = []
for noun in docx.noun_chunks: 
    if len(noun) > 1 and '-PRON-' not in noun.lemma_ and noun.is_stop:
        noun_phrases.append(noun)

AttributeError: 'spacy.tokens.span.Span' object has no attribute> 'is_stop'

I understand the nature of message, but I can't for the life of me get the syntax correctly where the presence of a stop word in a lemmatized string would excluded from being appended to the noun_phrases list

Output without removing stopwords

[{'word': 'The bird', 'lemma': 'the bird', 'len': 2}, {'word': 'the sky blue', 'lemma': 'the sky blue', 'len': 3}]

Intended Output (removing lemma containing stopwords, which include "the"

[{}]

What version of spacy and python are you using?

I am using Python 3.6.5 and spacy 2.0.12 on mac high sierra. Your code seem to displaying intended output.

import spacy
from collections import Counter

nlp = spacy.load('en_core_web_sm')

docx = nlp('The bird is flying high in the sky blue of color')

# Just looking at nouns
nouns = []
for token in docx:
    if token.is_stop != True and token.is_punct != True and token.pos_ == 'NOUN':
        nouns.append(token)

# Count and look at the most frequent nouns #
word_freq = Counter(nouns)
common_nouns = word_freq.most_common(10)

print( word_freq)
print(common_nouns)


$python3  /tmp/nlp.py
Counter({bird: 1, sky: 1, blue: 1, color: 1})
[(bird, 1), (sky, 1), (blue, 1), (color, 1)]

Also, 'is_stop' is an attribute of docx . You can check via

>>> dir(docx)

You may want to upgrade spacy and its dependencies and see if that helps.

Also, flying is a VERB, so even after lemmetization, it will not get appended to as per your condition.

token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop
flying fly VERB VBG ROOT xxxx True False

EDIT-1

You can try something like this. Since we can't use is_stop directly on word chunks, we can iterate through each chunk for word and check condition as per your requirements. (eg has not stop_word and has length > 1 etc). If that's satisfied then we append to a list.

noun_phrases = []
for chunk in docx.noun_chunks:
    print(chunk)
    if all(token.is_stop != True and token.is_punct != True and '-PRON-' not in token.lemma_ for token in chunk) == True:
        if len(chunk) > 1:
            noun_phrases.append(chunk)
print(noun_phrases)

Result:

python3 /tmp/so.py
Counter({bird: 1, sky: 1, blue: 1, color: 1})
[(bird, 1), (sky, 1), (blue, 1), (color, 1)]
The bird
the sky blue
color
[]   # contents of noun_phrases is empty here.

Hope this helps. You can tweak conditions in if all to match your requirements.

You may want to try Berkeley Natural parser as well. https://spacy.io/universe/project/self-attentive-parser I am told, it gives you a Penn Treebank parse tree. I am also told that it is slow :-(

Also, if I am not mistaken, a noun chunk consist of tokens, and tokens come with is_stop_, pos_ and tag_; ie,you can filter accordingly.

Two frustrating issues I have found with noun chunks is that it goes after N+Ps on the right boundary, with intermittent "and" between two noun chunks! In regards to the first issue, it will not pick up "the University of California" as a chunk, rather "the University" and "California" as two separate noun chunks.
Furthermore, it is not consistent, which kills me. Jim Smith and Jain Jones can come out as "Jim Smith" plus "Jain Jones" as two noun chunks; which is the right answer. Or "Jim Smith and Jain Jones" all as one noun chunk!?!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM