简体   繁体   English

如何将TfidfVectorizer的输出馈送到Sklearn中的LinearSVC分类器?

[英]How to feed the output of TfidfVectorizer to the LinearSVC classifier in Sklearn?

I'm trying to build a linear classifier using LinearSVC in Scikit learn. 我正在尝试在Scikit学习中使用LinearSVC构建线性分类器。 I decided to use the tf-idf vectorization for the purpose of vectorizing the text input. 我决定使用tf-idf向量化来对文本输入进行向量化。 The code I wrote is: 我写的代码是:

review_corpus = list(train_data_df['text'])
vectorizer = TfidfVectorizer(max_df = 0.9,stop_words = 'english')
%timeit tfidf_matrix = vectorizer.fit_transform(review_corpus)

I now want to train an SVM model using this tfidf_matrix and use it to predict the class/label for the corresponding test set: test_data_df['text'] . 现在,我想使用此tfidf_matrix训练SVM模型,并使用它来预测相应测试集的类/标签: test_data_df['text'] The problem(s) I'm having: 我遇到的问题:

  1. Is it correct to use only the training data to build the TfIdfVectorizer or should I use both the training and testing text data to build the vectorizer? 仅使用训练数据来构建TfIdfVectorizer是否正确,还是应该同时使用训练和测试文本数据来构建矢量化程序?
  2. The main issue is: How do I get the matrix representation for the testing data? 主要问题是:如何获得测试数据的矩阵表示形式? Currently, I'm not sure how to get the tfidf score from the vectorizer for the different documents in the test set. 目前,我不确定如何从矢量化仪中获取测试集中不同文档的tfidf分数。 What I tried was to loop through the Pandas series test_data_df['text'] and then do: 我想做的是遍历熊猫系列test_data_df['text'] ,然后执行以下操作:

     tfidf_matrix.todense(list(text) 

for each text in the Series, put the result into a list and finally make a numpy array out of it but I get a Memory Error. 对于系列中的每个文本,将结果放入列表中,最后从中创建一个numpy数组,但出现内存错误。

  1. You should use only the training data to build the TfIdfVectorizer() . 您应该仅使用训练数据来构建TfIdfVectorizer() This will ensure that you are not leaking any information about the test data during training process. 这将确保您在培训过程中不会泄漏有关测试数据的任何信息。

  2. Use 采用

     tfidf_matrix_test = vectorizer.transform(test_data_df['text']) 

Now you can feed the tfidf_matrix_test to the classifier. 现在,您可以将tfidf_matrix_test输入分类器。

PS: PS:

Try to avoid casting the sparse_matrix output of the Vectorizer to list or dense array. 尽量避免将Vectorizer的sparse_matrix输出强制转换为列表或密集数组。 Because it is memory intensive and classifier will also take more computation time while training/prediction. 因为它占用大量内存,并且分类器在训练/预测时也将花费更多的计算时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM