简体   繁体   中英

Text preprocessing for text classification using fastText

What text preprocessing produces the best results for supervised text classification using fastText ?

The official documentation shows a only a simple prepocessing consisting of lower-casing and separating punctuations. Would classic preprocessing like lemmatization, stopwords removal, masking numbers would help?

There is no general answer. It very much depends on what task you are trying to solve, how big data you have, and what language the text is in. Usually, if you have enough data, simple tokenization that you described is all you need.

Lemmatization : FastText computes the word embeddings from embeddings of character n -grams, it should cover most morphology in most (at least European) languages, given you don't have very small data. In that case, lemmatization might help.

Removing stopwords : It depends on the task. If the task is based on grammar/syntax, you definitely should not remove the stopwords, because they form the grammar. If the task depends more on lexical semantics, removing stopwords should help. If your training data is large enough, the model should learn non-informative stopword embeddings that would not influence the classification.

Masking numbers: If you are sure that your task does not benefit from knowing the numbers, you can mask them out. Usually, the problem is that numbers do not appear frequently in the training data, so you don't learn appropriate weights/embeddings for them. Not so much in FastText which will compose their embeddings from embeddings of their substrings. It will make them probably uninformative at the end, not influencing the classification.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM