简体   繁体   English

如何使用TfIdfVectorizer使用SciKitLearn对文档进行分类?

[英]How do I classify documents with SciKitLearn using TfIdfVectorizer?

The following example shows how one can train a classifier with the Sklearn 20 newsgroups data. 以下示例显示了如何使用Sklearn 20新闻组数据训练分类器。

>>> from sklearn.feature_extraction.text import TfidfVectorizer 
>>> categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
>>> newsgroups_train = fetch_20newsgroups(subset='train', ... categories=categories) 
>>> vectorizer = TfidfVectorizer() >>> vectors = vectorizer.fit_transform(newsgroups_train.data) 
>>> vectors.shape (2034, 34118)

However, I have my own labeled corpus that I would like to use. 但是,我有自己标记的语料库,我想使用它。

After getting a tfidfvector of my own data, would I train a classifier like this? 获得我自己的数据的tfidfvector后,我会训练这样的分类器吗?

classif_nb = nltk.NaiveBayesClassifier.train(vectorizer)

To recap: How can I use my own corpus instead of the 20newsgroups, but in the same way used here? 回顾一下:我如何使用自己的语料库而不是20个新组,但这里使用的方法相同? How can I then use my TFIDFVectorized corpus to train a classifier? 那么如何使用我的TFIDFVectorized语料库来训练分类器呢?

Thanks! 谢谢!

To address questions from comments; 解决评论中的问题; The whole basic process of working with tfidf representation in some classification task you should: 在一些分类任务中使用tfidf表示的整个基本过程你应该:

  1. You fit the vectorizer to your training data and save it in some variable, lets call it tfidf 您将矢量化器拟合到训练数据并将其保存在某个变量中,我们称之为tfidf
  2. You transform training data (without labels, just text) through data = tfidf.transform(...) 您通过data = tfidf.transform(...) 转换训练数据(没有标签,只是文本)
  3. You fit the model (classifier) using some_classifier.fit( data, labels ), where labels are in the same order as documnents in data 您使用some_classifier.fit(数据,标签)来拟合模型(分类器),其中标签与数据中的文档顺序相同
  4. During testing you use tfidf.transform( ... ) on new data, and check the prediction of your model 在测试期间,您对新数据使用tfidf.transform(...),并检查模型的预测

In general, for sklearn the flow is: 一般来说,对于sklearn,流程是:

  1. Convert your string data to numeric values usinf some vectorizer for eg TfIDF,count etcs 使用某些矢量化器将您的字符串数据转换为数值,例如TfIDF,count等
  2. fit and transform 适应和变换
  3. Pass it to train/fit of your choice of classifier. 将它传递给您选择的分类器。

You did not mention your data format but if it is csv file with some rows,flow could be: 您没有提到您的数据格式,但如果它是包含某些行的csv文件,则流程可能是:

  1. Read each row of text 阅读每一行文字
  2. Pre process, like remove the stop words etc. 预处理,如删除停用词等。
  3. raw_data_list = [row1,row2,rown...] raw_data_list = [row1,row2,rown ...]
  4. vectorizer = TfidfVectorizer() vectorizer = TfidfVectorizer()
  5. x_transformed = vectorizer.fit_transform(raw_data_list) x_transformed = vectorizer.fit_transform(raw_data_list)
  6. x_transformed can be passed to fit/train function of classifiers. 可以传递x_transformed以适合/训练分类器的功能。

And once you have trained classifier you can call predict for new data. 一旦你训练了分类器,就可以调用预测新数据。 Remeber to convert new data to same format as data on which you trained by using above used and fitted vectorizer before passing it to classif.predict. 记得在将新数据传递给classif.predict之前,使用上面使用过的和适合的矢量化器将新数据转换为与您训练过的数据相同的格式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在搜索引擎中使用 ScikitLearn TfidfVectorizer - Using ScikitLearn TfidfVectorizer in a search engine 使用scikitlearn检索错误分类的文档 - retrieve misclassified documents using scikitlearn 如何使用 RandomForestRegressor 方法在 Python 中使用 scikitlearn、pandas 预测未来结果? - How do I predict future results with scikitlearn, pandas in Python using RandomForestRegressor method? 如何从 TfidfVectorizer 计算余弦相似度? - How do I calculate cosine similarity from TfidfVectorizer? 如何存储 TfidfVectorizer 以备将来在 scikit-learn 中使用? - How do I store a TfidfVectorizer for future use in scikit-learn? 如何在两个步骤中使用TfidfVectorizer,增加分析文本的数量? - How do i use TfidfVectorizer in 2 steps, incrementing the number of analyzed texts? 如何在特定情况下对 dataframe 进行分类? - How do I classify a dataframe in a specific case? 如何使用nlp对我拥有的数据集将评论分为好与坏? - how do i classify the reviews as good and bad using nlp for the dataset that i have? 如何使用在不同项目中构建的分类模型对新文本进行分类? - How do I classify new text using a classification model built in a different project? 如何使用tf-idf对新文档进行分类? - How to classify new documents with tf-idf?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM