简体繁体 English

使用 fastText 进行文本分类的文本预处理

[英]Text preprocessing for text classification using fastText

原文 2020-06-07 11:12:43 5 1 python/ nlp/ text-classification/ fasttext

What text preprocessing produces the best results for supervised text classification using fastText ?使用fastText进行有监督的文本分类时，哪些文本预处理产生了最佳结果？

The official documentation shows a only a simple prepocessing consisting of lower-casing and separating punctuations.官方文档仅显示了一个简单的预处理，由小写字母和分隔标点符号组成。 Would classic preprocessing like lemmatization, stopwords removal, masking numbers would help?经典的预处理（如词形还原、停用词删除、屏蔽数字）会有所帮助吗？

1 个解决方案

There is no general answer.没有普遍的答案。 It very much depends on what task you are trying to solve, how big data you have, and what language the text is in. Usually, if you have enough data, simple tokenization that you described is all you need.这在很大程度上取决于您要解决的任务、您拥有的数据量以及文本使用的语言。通常，如果您有足够的数据，您所描述的简单标记化就是您所需要的。

Lemmatization : FastText computes the word embeddings from embeddings of character n -grams, it should cover most morphology in most (at least European) languages, given you don't have very small data. Lemmatization ：FastText 从字符n -gram 的嵌入中计算词嵌入，它应该涵盖大多数（至少是欧洲）语言中的大多数形态，因为您没有非常小的数据。 In that case, lemmatization might help.在这种情况下，词形还原可能会有所帮助。

Removing stopwords : It depends on the task.删除停用词：这取决于任务。 If the task is based on grammar/syntax, you definitely should not remove the stopwords, because they form the grammar.如果任务基于语法/句法，你绝对不应该删除停用词，因为它们构成了语法。 If the task depends more on lexical semantics, removing stopwords should help.如果任务更多地依赖于词汇语义，那么删除停用词应该会有所帮助。 If your training data is large enough, the model should learn non-informative stopword embeddings that would not influence the classification.如果您的训练数据足够大，model 应该学习不会影响分类的非信息性停用词嵌入。

Masking numbers: If you are sure that your task does not benefit from knowing the numbers, you can mask them out.屏蔽数字：如果您确定您的任务不会从知道数字中受益，您可以将它们屏蔽掉。 Usually, the problem is that numbers do not appear frequently in the training data, so you don't learn appropriate weights/embeddings for them.通常，问题是数字在训练数据中不经常出现，因此您没有为它们学习适当的权重/嵌入。 Not so much in FastText which will compose their embeddings from embeddings of their substrings.在 FastText 中并没有那么多，它将从它们的子字符串的嵌入中构成它们的嵌入。 It will make them probably uninformative at the end, not influencing the classification.这将使它们最终可能无法提供信息，而不影响分类。