[英]recreating sparse matrix columns for ML model predictions
I have been working on a model using sklearn and a big part of it utilizes the CountVectorizer()
function to create a sparse matrix from a set of strings in the training set. 我一直在使用sklearn建立模型,其中很大一部分利用CountVectorizer()
函数从训练集中的一组字符串创建稀疏矩阵。
ex: 例如:
vectorizer = CountVectorizer(max_features=3000)
sparse_matrix = vectorizer.fit_transform(corpus).toarray()
After exporting the model, whats the best way to format the data I would like to create a prediction on to match the feature names created by the training? 导出模型后,格式化数据的最佳方法是我想创建预测以匹配训练创建的特征名称的数据? Should I be exporting (via pickle?) vectorizer.get_feature_names()
as well and then use that? 我是否还应该导出(通过泡菜?) vectorizer.get_feature_names()
,然后使用它? Or is there a better way? 或者,还有更好的方法?
In other words, if in my training set vectorizer.get_feature_names() = ['apple', 'dog', 'cat']
and I would like to make a prediction on 'hello cat'
, what should my method for feature extraction on the prediction request be? 换句话说,如果在我的训练集中vectorizer.get_feature_names() = ['apple', 'dog', 'cat']
并且我想对'hello cat'
做出预测,那么我的特征提取方法应基于什么预测要求是? Correct me if Im wrong, but result of the feature extraction would need to be [0, 0, 1]
to match the model. 如果Im错误,请纠正我,但是特征提取的结果需要为[0, 0, 1]
才能匹配模型。
I could also be totally off on my approach here as well, so any help or suggestions are appreciated 在这里我也可能完全不了解我的方法,因此不胜感激任何帮助或建议
Thanks! 谢谢!
When you type 当您键入
vectorizer = CountVectorizer(max_features=3000)
sparse_matrix = vectorizer.fit_transform(corpus).toarray()
This vectorizer
is used to fit the VOCABULARY of words that you have in corpus
此vectorizer
器用于适合您在corpus
的单词的VOCABULARY
so, use the SAME vectorizer
to transform
another dataset and you will see the frequency of words of your new dataset CORRESPONDING to the vocabulary of corpus
因此,使用SAME vectorizer
来transform
另一个数据集,您将看到新数据集的词频与corpus
的词汇相对应
Remember you perform fit_transform(X)
to say "Use the vocabulary of X
", and you do that just once. 记住,您执行fit_transform(X)
时说“使用X
的词汇”,并且只执行一次。 And then you do JUST tranform(Y)
as a way of saying, "Whatever you used for X, use them as columns, and fit the terms in Y
into these X
columns 然后,您只需进行tranform(Y)
表示,“无论您将X用作什么,都将其用作列,并将Y
的术语适合这些X
列
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['love dogs, hate cows, and also pigs, actually dogs too']
vectorizer = CountVectorizer(max_features=3000)
sparse_matrix = vectorizer.fit_transform(corpus)
df = pd.DataFrame(sparse_matrix.toarray())
df.columns = vectorizer.get_feature_names()
print(df)
Would give you this: 会给你这个:
actually also and cows dogs hate love pigs too
0 1 1 1 1 2 1 1 1 1
and then: 接着:
test = vectorizer.transform(['hello cat']) #Notice how I use transform and not fit_transform
df = pd.DataFrame(test.toarray())
df.columns = vectorizer.get_feature_names()
print(df)
actually also and cat cows dogs hate love pigs too
0 0 0 0 1 0 0 0 0 0 0
Notice how hello cat
was fit on the vocabulary of terms that fit_transform
was called on. 请注意, hello cat
如何适合被调用fit_transform
的术语词汇表。 So in your feature extraction, you FIT your 'hello cat' to the vocabulary you called fit_transform
on! 因此,在特征提取中,您将“ hello cat” fit_transform
到称为fit_transform
的词汇表上!
And now, USE ALL THESE 10 columns as FEATURES to predict a label y
. 现在,将所有这10列用作特征以预测标签y
。 What you are doing is called a Vector Space Model 您正在做的事情称为向量空间模型
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.