[英]How do I classify documents with SciKitLearn using TfIdfVectorizer?
The following example shows how one can train a classifier with the Sklearn 20 newsgroups data. 以下示例显示了如何使用Sklearn 20新闻组数据训练分类器。
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
>>> newsgroups_train = fetch_20newsgroups(subset='train', ... categories=categories)
>>> vectorizer = TfidfVectorizer() >>> vectors = vectorizer.fit_transform(newsgroups_train.data)
>>> vectors.shape (2034, 34118)
However, I have my own labeled corpus that I would like to use. 但是,我有自己标记的语料库,我想使用它。
After getting a tfidfvector of my own data, would I train a classifier like this? 获得我自己的数据的tfidfvector后,我会训练这样的分类器吗?
classif_nb = nltk.NaiveBayesClassifier.train(vectorizer)
To recap: How can I use my own corpus instead of the 20newsgroups, but in the same way used here? 回顾一下:我如何使用自己的语料库而不是20个新组,但这里使用的方法相同? How can I then use my TFIDFVectorized corpus to train a classifier? 那么如何使用我的TFIDFVectorized语料库来训练分类器呢?
Thanks! 谢谢!
To address questions from comments; 解决评论中的问题; The whole basic process of working with tfidf representation in some classification task you should: 在一些分类任务中使用tfidf表示的整个基本过程你应该:
In general, for sklearn the flow is: 一般来说,对于sklearn,流程是:
You did not mention your data format but if it is csv file with some rows,flow could be: 您没有提到您的数据格式,但如果它是包含某些行的csv文件,则流程可能是:
And once you have trained classifier you can call predict for new data. 一旦你训练了分类器,就可以调用预测新数据。 Remeber to convert new data to same format as data on which you trained by using above used and fitted vectorizer before passing it to classif.predict. 记得在将新数据传递给classif.predict之前,使用上面使用过的和适合的矢量化器将新数据转换为与您训练过的数据相同的格式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.