如何使用TfIdfVectorizer使用SciKitLearn对文档进行分类？

Question

The following example shows how one can train a classifier with the Sklearn 20 newsgroups data. 以下示例显示了如何使用Sklearn 20新闻组数据训练分类器。

>>> from sklearn.feature_extraction.text import TfidfVectorizer 
>>> categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
>>> newsgroups_train = fetch_20newsgroups(subset='train', ... categories=categories) 
>>> vectorizer = TfidfVectorizer() >>> vectors = vectorizer.fit_transform(newsgroups_train.data) 
>>> vectors.shape (2034, 34118)

However, I have my own labeled corpus that I would like to use. 但是，我有自己标记的语料库，我想使用它。

After getting a tfidfvector of my own data, would I train a classifier like this? 获得我自己的数据的tfidfvector后，我会训练这样的分类器吗？

classif_nb = nltk.NaiveBayesClassifier.train(vectorizer)

To recap: How can I use my own corpus instead of the 20newsgroups, but in the same way used here? 回顾一下：我如何使用自己的语料库而不是20个新组，但这里使用的方法相同？ How can I then use my TFIDFVectorized corpus to train a classifier? 那么如何使用我的TFIDFVectorized语料库来训练分类器呢？

Thanks! 谢谢！

Answer 1

To address questions from comments; 解决评论中的问题; The whole basic process of working with tfidf representation in some classification task you should: 在一些分类任务中使用tfidf表示的整个基本过程你应该：

You fit the vectorizer to your training data and save it in some variable, lets call it tfidf 您将矢量化器拟合到训练数据并将其保存在某个变量中，我们称之为tfidf
You transform training data (without labels, just text) through data = tfidf.transform(...) 您通过data = tfidf.transform（...）转换训练数据（没有标签，只是文本）
You fit the model (classifier) using some_classifier.fit( data, labels ), where labels are in the same order as documnents in data 您使用some_classifier.fit（数据，标签）来拟合模型（分类器），其中标签与数据中的文档顺序相同
During testing you use tfidf.transform( ... ) on new data, and check the prediction of your model 在测试期间，您对新数据使用tfidf.transform（...），并检查模型的预测

Answer 2

In general, for sklearn the flow is: 一般来说，对于sklearn，流程是：

Convert your string data to numeric values usinf some vectorizer for eg TfIDF,count etcs 使用某些矢量化器将您的字符串数据转换为数值，例如TfIDF，count等
fit and transform 适应和变换
Pass it to train/fit of your choice of classifier. 将它传递给您选择的分类器。

You did not mention your data format but if it is csv file with some rows,flow could be: 您没有提到您的数据格式，但如果它是包含某些行的csv文件，则流程可能是：

Read each row of text 阅读每一行文字
Pre process, like remove the stop words etc. 预处理，如删除停用词等。
raw_data_list = [row1,row2,rown...] raw_data_list = [row1，row2，rown ...]
vectorizer = TfidfVectorizer() vectorizer = TfidfVectorizer（）
x_transformed = vectorizer.fit_transform(raw_data_list) x_transformed = vectorizer.fit_transform（raw_data_list）
x_transformed can be passed to fit/train function of classifiers. 可以传递x_transformed以适合/训练分类器的功能。

And once you have trained classifier you can call predict for new data. 一旦你训练了分类器，就可以调用预测新数据。 Remeber to convert new data to same format as data on which you trained by using above used and fitted vectorizer before passing it to classif.predict. 记得在将新数据传递给classif.predict之前，使用上面使用过的和适合的矢量化器将新数据转换为与您训练过的数据相同的格式。

如何使用TfIdfVectorizer使用SciKitLearn对文档进行分类？

问题描述

2 个解决方案

解决方案1
8 已采纳 2013-10-30 07:53:37

解决方案2
2 2013-10-30 04:14:15

如何使用TfIdfVectorizer使用SciKitLearn对文档进行分类？

问题描述

2 个解决方案

解决方案1 8 已采纳 2013-10-30 07:53:37

解决方案2 2 2013-10-30 04:14:15

解决方案1
8 已采纳 2013-10-30 07:53:37

解决方案2
2 2013-10-30 04:14:15