简体繁体中英

Remove SPECIAL stopwords for NLP

原文 2022-01-15 12:55:12 7 1 python/ nlp/ text-classification/ stop-words

I am having a text classification task. I want to classify a set of documents in 4 categories (Business, Entertainment, Health, Technology). I create wordcloud for every category(i removed stopwords) and each wordcloud still contains stopwords such as (april, tuesday, yesterday, hundred). I merged the stopwords sets from SpaCy, NLTK, gensim in a complete set of stopwords. I performed a "remove_stopwords" function, but I realized that many special stopwords remain in the text.

Question1 I want to remove the following:

Location Stopwords – Country names, Cities names etc

Time Stopwords – Name of the months and days (january, february, monday, tuesday, today, tomorrow …) etc

Numerals Stopwords – Words describing numerical terms ( hundred, thousand, … etc)

Doing this by hand, its a time consuming task. Is there any better solution?

Question2

In another text classification problem with 4 classes(business, science, sports, world). Take a look for example at worlds column. Is it a good practice to use words like "monday, yesterday" to classify a text in "worlds" category?

1 answers

In NLP there's no clear definition of "stop words", let alone "special" stop words. The concept usually refers to frequent words (typically grammatical words) which do not contribute to the semantic of the text, so they can be filtered out. Since there's no definition, one is free is define stop words in any way they want.

At the other end of the frequency spectrum, rare words can cause more serious issues because the classifier can mistakenly associate them with a class, even though they mostly happen by chance (this is overfitting). Rare words are not usually called "stop words", but most of the examples you mention probably fall into this category, like city names, months, numbers. In general rare words need to be filtered out in order to avoid overfitting, typically by specifying a minimum frequency (eg with argument min_df in CountVectorizer ).

So in general the method is not to predefine a list containing all the possible "stop words": this would be costly, error-prone and it defeats the purpose of ML since most of the work of the classifier is done manually beforehand. The classifier can perfectly take care itself of words which appear frequently enough: if these words are not relevant for the class, it ignores them. However it could make errors due to rare words, so this should be taken care of... and this is much easier than preparing a huge list of stop words.

unable to remove stopwords;NLP

nlp stopwords punctuations tokenize

Remove stopwords from sentences

Remove stopwords from dataframe

how to remove stopwords in Arabic?

How to remove stopwords in gensim?

Remove custom stopwords

Remove specific stopwords Pyspark

Remove Stopwords in python

spam filtering: remove stopwords

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question unable to remove stopwords;NLP nlp stopwords punctuations tokenize Remove stopwords from sentences Remove stopwords from dataframe how to remove stopwords in Arabic? How to remove stopwords in gensim? Remove custom stopwords Remove specific stopwords Pyspark Remove Stopwords in python spam filtering: remove stopwords

Related Tags

Remove SPECIAL stopwords for NLP

Question

1 answers

solution1 0 2022-01-15 13:54:27

solution1
0 2022-01-15 13:54:27