简体繁体 English

C++中文档分类的SVM实现

[英]SVM implementation for document classification in c++

原文 2020-02-26 18:29:03 5 1 c++/ text/ svm

I would like to implement a small project to classify a set of documents (file.txt) into number of categories then test new documents according to that using SVM in c++.我想实现一个小项目，将一组文档（file.txt）分类为多个类别，然后根据在 C++ 中使用 SVM 的测试新文档。

I searched widely for that but still, I did not get full understanding of what i need to do !我对此进行了广泛的搜索，但仍然没有完全了解我需要做什么！ I heard about LIBLINEAR library but I do not know how to use it, if I will use TF-IDF, do I need to have a vector for each class ?我听说过 LIBLINEAR 库，但我不知道如何使用它，如果我将使用 TF-IDF，我是否需要为每个类都有一个向量？ or one vector for all classes?还是所有类的一个向量？ how to test new document using TF-IDF ?如何使用 TF-IDF 测试新文档？ I am really confused !我真的很困惑！

1 个解决方案

Is it a requirement that it is written in c++?是否要求用 C++ 编写？ Python offers a lot of helpful resource and ready-to-use modules for machine learning tasks such as svm implementation and usage. Python 为机器学习任务（例如 svm 实现和使用）提供了许多有用的资源和即用型模块。

On scikit-learn for instance, helpful resources about that topic can be found, for instance this one: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html例如，在 scikit-learn 上，可以找到有关该主题的有用资源，例如这个： https : //scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

And as far as your question goes - for TF-IDF implementation you will need a vector for every document.就您的问题而言 - 对于 TF-IDF 实施，您需要为每个文档提供一个向量。 For every document, the words in it will be listed and assigned values (TF-IDF values).对于每个文档，其中的单词将被列出并分配值（TF-IDF 值）。