如何将TfidfVectorizer的输出馈送到Sklearn中的LinearSVC分类器？

Question

I'm trying to build a linear classifier using LinearSVC in Scikit learn. 我正在尝试在Scikit学习中使用LinearSVC构建线性分类器。 I decided to use the tf-idf vectorization for the purpose of vectorizing the text input. 我决定使用tf-idf向量化来对文本输入进行向量化。 The code I wrote is: 我写的代码是：

review_corpus = list(train_data_df['text'])
vectorizer = TfidfVectorizer(max_df = 0.9,stop_words = 'english')
%timeit tfidf_matrix = vectorizer.fit_transform(review_corpus)

I now want to train an SVM model using this tfidf_matrix and use it to predict the class/label for the corresponding test set: test_data_df['text'] . 现在，我想使用此tfidf_matrix训练SVM模型，并使用它来预测相应测试集的类/标签： test_data_df['text'] 。 The problem(s) I'm having: 我遇到的问题：

Is it correct to use only the training data to build the TfIdfVectorizer or should I use both the training and testing text data to build the vectorizer? 仅使用训练数据来构建TfIdfVectorizer是否正确，还是应该同时使用训练和测试文本数据来构建矢量化程序？
The main issue is: How do I get the matrix representation for the testing data? 主要问题是：如何获得测试数据的矩阵表示形式？ Currently, I'm not sure how to get the tfidf score from the vectorizer for the different documents in the test set. 目前，我不确定如何从矢量化仪中获取测试集中不同文档的tfidf分数。 What I tried was to loop through the Pandas series test_data_df['text'] and then do: 我想做的是遍历熊猫系列test_data_df['text'] ，然后执行以下操作：
```
 tfidf_matrix.todense(list(text) 
```

for each text in the Series, put the result into a list and finally make a numpy array out of it but I get a Memory Error. 对于系列中的每个文本，将结果放入列表中，最后从中创建一个numpy数组，但出现内存错误。

Answer 1

You should use only the training data to build the TfIdfVectorizer() . 您应该仅使用训练数据来构建TfIdfVectorizer() 。 This will ensure that you are not leaking any information about the test data during training process. 这将确保您在培训过程中不会泄漏有关测试数据的任何信息。

Use 采用

 tfidf_matrix_test = vectorizer.transform(test_data_df['text'])

Now you can feed the tfidf_matrix_test to the classifier. 现在，您可以将tfidf_matrix_test输入分类器。

PS: PS：

Try to avoid casting the sparse_matrix output of the Vectorizer to list or dense array. 尽量避免将Vectorizer的sparse_matrix输出强制转换为列表或密集数组。 Because it is memory intensive and classifier will also take more computation time while training/prediction. 因为它占用大量内存，并且分类器在训练/预测时也将花费更多的计算时间。

如何将TfidfVectorizer的输出馈送到Sklearn中的LinearSVC分类器？

问题描述

1 个解决方案

解决方案1
0 2019-03-18 06:51:19

如何将TfidfVectorizer的输出馈送到Sklearn中的LinearSVC分类器？

问题描述

1 个解决方案

解决方案1 0 2019-03-18 06:51:19

解决方案1
0 2019-03-18 06:51:19