重新创建稀疏矩阵列以进行ML模型预测

Question

I have been working on a model using sklearn and a big part of it utilizes the CountVectorizer() function to create a sparse matrix from a set of strings in the training set. 我一直在使用sklearn建立模型，其中很大一部分利用CountVectorizer()函数从训练集中的一组字符串创建稀疏矩阵。

ex: 例如：

vectorizer = CountVectorizer(max_features=3000)
sparse_matrix = vectorizer.fit_transform(corpus).toarray()

After exporting the model, whats the best way to format the data I would like to create a prediction on to match the feature names created by the training? 导出模型后，格式化数据的最佳方法是我想创建预测以匹配训练创建的特征名称的数据？ Should I be exporting (via pickle?) vectorizer.get_feature_names() as well and then use that? 我是否还应该导出（通过泡菜？） vectorizer.get_feature_names() ，然后使用它？ Or is there a better way? 或者，还有更好的方法？

In other words, if in my training set vectorizer.get_feature_names() = ['apple', 'dog', 'cat'] and I would like to make a prediction on 'hello cat' , what should my method for feature extraction on the prediction request be? 换句话说，如果在我的训练集中vectorizer.get_feature_names() = ['apple', 'dog', 'cat']并且我想对'hello cat'做出预测，那么我的特征提取方法应基于什么预测要求是？ Correct me if Im wrong, but result of the feature extraction would need to be [0, 0, 1] to match the model. 如果Im错误，请纠正我，但是特征提取的结果需要为[0, 0, 1]才能匹配模型。

I could also be totally off on my approach here as well, so any help or suggestions are appreciated 在这里我也可能完全不了解我的方法，因此不胜感激任何帮助或建议

Thanks! 谢谢！

Answer 1

When you type 当您键入

vectorizer = CountVectorizer(max_features=3000)
sparse_matrix = vectorizer.fit_transform(corpus).toarray()

This vectorizer is used to fit the VOCABULARY of words that you have in corpus 此vectorizer器用于适合您在corpus的单词的VOCABULARY

so, use the SAME vectorizer to transform another dataset and you will see the frequency of words of your new dataset CORRESPONDING to the vocabulary of corpus 因此，使用SAME vectorizer来transform另一个数据集，您将看到新数据集的词频与corpus的词汇相对应

Remember you perform fit_transform(X) to say "Use the vocabulary of X ", and you do that just once. 记住，您执行fit_transform(X)时说“使用X的词汇”，并且只执行一次。 And then you do JUST tranform(Y) as a way of saying, "Whatever you used for X, use them as columns, and fit the terms in Y into these X columns 然后，您只需进行tranform(Y)表示，“无论您将X用作什么，都将其用作列，并将Y的术语适合这些X列

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['love dogs, hate cows, and also pigs, actually dogs too']
vectorizer = CountVectorizer(max_features=3000)
sparse_matrix = vectorizer.fit_transform(corpus)
df = pd.DataFrame(sparse_matrix.toarray())
df.columns = vectorizer.get_feature_names()
print(df)

Would give you this: 会给你这个：

   actually  also  and  cows  dogs  hate  love  pigs  too
0         1     1    1     1     2     1     1     1    1

and then: 接着：

test = vectorizer.transform(['hello cat']) #Notice how I use transform and not fit_transform
df = pd.DataFrame(test.toarray())
df.columns = vectorizer.get_feature_names()
print(df)

   actually  also  and  cat  cows  dogs  hate  love  pigs  too
0         0     0    0    1     0     0     0     0     0    0

Notice how hello cat was fit on the vocabulary of terms that fit_transform was called on. 请注意， hello cat如何适合被调用fit_transform的术语词汇表。 So in your feature extraction, you FIT your 'hello cat' to the vocabulary you called fit_transform on! 因此，在特征提取中，您将“ hello cat” fit_transform到称为fit_transform的词汇表上！

And now, USE ALL THESE 10 columns as FEATURES to predict a label y . 现在，将所有这10列用作特征以预测标签y 。 What you are doing is called a Vector Space Model 您正在做的事情称为向量空间模型

重新创建稀疏矩阵列以进行ML模型预测

问题描述

1 个解决方案

解决方案1
1 2018-10-06 00:26:03

重新创建稀疏矩阵列以进行ML模型预测

问题描述

1 个解决方案

解决方案1 1 2018-10-06 00:26:03

解决方案1
1 2018-10-06 00:26:03