简体繁体中英

Text preprocessing for text classification using fastText

原文 2020-06-07 11:12:43 0 1 python/ nlp/ text-classification/ fasttext

What text preprocessing produces the best results for supervised text classification using fastText ?

The official documentation shows a only a simple prepocessing consisting of lower-casing and separating punctuations. Would classic preprocessing like lemmatization, stopwords removal, masking numbers would help?

1 answers

There is no general answer. It very much depends on what task you are trying to solve, how big data you have, and what language the text is in. Usually, if you have enough data, simple tokenization that you described is all you need.

Lemmatization : FastText computes the word embeddings from embeddings of character n -grams, it should cover most morphology in most (at least European) languages, given you don't have very small data. In that case, lemmatization might help.

Removing stopwords : It depends on the task. If the task is based on grammar/syntax, you definitely should not remove the stopwords, because they form the grammar. If the task depends more on lexical semantics, removing stopwords should help. If your training data is large enough, the model should learn non-informative stopword embeddings that would not influence the classification.

Masking numbers: If you are sure that your task does not benefit from knowing the numbers, you can mask them out. Usually, the problem is that numbers do not appear frequently in the training data, so you don't learn appropriate weights/embeddings for them. Not so much in FastText which will compose their embeddings from embeddings of their substrings. It will make them probably uninformative at the end, not influencing the classification.

Printing Classification Report in FastText Text Classification output

Text Preprocessing for classification - Machine Learning

Text Classification Using spaCy

Text Classification Using Python

Using Keras for text classification

fastText test_label shows recall as nan for all labels in text classification

Cleaning Webscrape Text for FastText in Python

text classification using logistic regression

Spacy text classification using minibatch

Multilabel Text Classification using TensorFlow

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Printing Classification Report in FastText Text Classification output Text Preprocessing for classification - Machine Learning Text Classification Using spaCy Text Classification Using Python Using Keras for text classification fastText test_label shows recall as nan for all labels in text classification Cleaning Webscrape Text for FastText in Python text classification using logistic regression Spacy text classification using minibatch Multilabel Text Classification using TensorFlow

Related Tags

Text preprocessing for text classification using fastText

Question

1 answers

solution1 3 ACCPTED 2020-06-08 09:55:23

solution1
3 ACCPTED 2020-06-08 09:55:23