Add/remove custom stop words with spacy

Question

What is the best way to add/remove stop words with spacy? I am using token.is_stop function and would like to make some custom changes to the set. I was looking at the documentation but could not find anything regarding of stop words. Thanks!

Answer 1

Using Spacy 2.0.11, you can update its stopwords set using one of the following:

To add a single stopword:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.add("my_new_stopword")

To add several stopwords at once:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}

To remove a single stopword:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.remove("whatever")

To remove several stopwords at once:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words -= {"whatever", "whenever"}

Note: To see the current set of stopwords, use:

print(nlp.Defaults.stop_words)

Update : It was noted in the comments that this fix only affects the current execution. To update the model, you can use the methods nlp.to_disk("/path") and nlp.from_disk("/path") (further described at https://spacy.io/usage/saving-loading ).

Answer 2

You can edit them before processing your text like this (see this post ):

>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True

Note: This seems to work <=v1.8. For newer versions, see other answers.

Answer 3

For version 2.0 I used this:

from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS) # <- set of Spacy's default stop words

STOP_WORDS.add("your_additional_stop_word_here")

for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True

This loads all stop words into a set.

You can amend your stop words to STOP_WORDS or use your own list in the first place.

Answer 4

For 2.0 use the following:

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

Answer 5

这也收集停用词:)

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

Answer 6

In latest version following would remove the word out of the list:

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords.remove('not')

Answer 7

For version 2.3.0 If you want to replace the entire list instead of adding or removing a few stop words, you can do this:

custom_stop_words = set(['the','and','a'])

# First override the stop words set for the language
cls = spacy.util.get_lang_class('en')
cls.Defaults.stop_words = custom_stop_words

# Now load your model
nlp = spacy.load('en_core_web_md')

The trick is to assign the stop word set for the language before loading the model. It also ensures that any upper/lower case variation of the stop words are considered stop words.

Answer 8

See below piece of code

# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

# Print the set of spaCy's default stop words (remember that sets are unordered):
print(nlp.Defaults.stop_words)

len(nlp.Defaults.stop_words)

# Make list of word you want to add to stop words
list = ['apple', 'ball', 'cat']

# Iterate this in loop

for item in list:
    # Add the word to the set of stop words. Use lowercase!
    nlp.Defaults.stop_words.add(item)
    
    # Set the stop_word tag on the lexeme
    nlp.vocab[item].is_stop = True

Hope this helps. You can print length before and after to confirm.

Add/remove custom stop words with spacy

Question

8 answers

solution1
51 2018-08-01 06:49:38

solution2
45 ACCPTED 2016-12-15 19:52:57

solution3
19 2017-09-23 13:52:03

solution4
4 2018-03-25 09:55:28

solution5
2 2019-08-23 12:10:37

solution6
0 2019-09-20 11:46:35

solution7
0 2021-03-04 21:32:49

solution8
0 2023-01-03 05:19:31

Add/remove custom stop words with spacy

Question

8 answers

solution1 51 2018-08-01 06:49:38

solution2 45 ACCPTED 2016-12-15 19:52:57

solution3 19 2017-09-23 13:52:03

solution4 4 2018-03-25 09:55:28

solution5 2 2019-08-23 12:10:37

solution6 0 2019-09-20 11:46:35

solution7 0 2021-03-04 21:32:49

solution8 0 2023-01-03 05:19:31

solution1
51 2018-08-01 06:49:38

solution2
45 ACCPTED 2016-12-15 19:52:57

solution3
19 2017-09-23 13:52:03

solution4
4 2018-03-25 09:55:28

solution5
2 2019-08-23 12:10:37

solution6
0 2019-09-20 11:46:35

solution7
0 2021-03-04 21:32:49

solution8
0 2023-01-03 05:19:31