简体   繁体   中英

Remove SPECIAL stopwords for NLP

I am having a text classification task. I want to classify a set of documents in 4 categories (Business, Entertainment, Health, Technology). I create wordcloud for every category(i removed stopwords) and each wordcloud still contains stopwords such as (april, tuesday, yesterday, hundred). I merged the stopwords sets from SpaCy, NLTK, gensim in a complete set of stopwords. I performed a "remove_stopwords" function, but I realized that many special stopwords remain in the text.

Question1 I want to remove the following:

Location Stopwords – Country names, Cities names etc

Time Stopwords – Name of the months and days (january, february, monday, tuesday, today, tomorrow …) etc

Numerals Stopwords – Words describing numerical terms ( hundred, thousand, … etc)

Doing this by hand, its a time consuming task. Is there any better solution?

Question2

In another text classification problem with 4 classes(business, science, sports, world). Take a look for example at worlds column. Is it a good practice to use words like "monday, yesterday" to classify a text in "worlds" category? 在此处输入图像描述

In NLP there's no clear definition of "stop words", let alone "special" stop words. The concept usually refers to frequent words (typically grammatical words) which do not contribute to the semantic of the text, so they can be filtered out. Since there's no definition, one is free is define stop words in any way they want.

At the other end of the frequency spectrum, rare words can cause more serious issues because the classifier can mistakenly associate them with a class, even though they mostly happen by chance (this is overfitting). Rare words are not usually called "stop words", but most of the examples you mention probably fall into this category, like city names, months, numbers. In general rare words need to be filtered out in order to avoid overfitting, typically by specifying a minimum frequency (eg with argument min_df in CountVectorizer ).

So in general the method is not to predefine a list containing all the possible "stop words": this would be costly, error-prone and it defeats the purpose of ML since most of the work of the classifier is done manually beforehand. The classifier can perfectly take care itself of words which appear frequently enough: if these words are not relevant for the class, it ignores them. However it could make errors due to rare words, so this should be taken care of... and this is much easier than preparing a huge list of stop words.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM