使用经过训练的模型进行预测

Question

I used Logistic regression to create a model ,later saved the model using joblib.我使用 Logistic 回归创建模型，然后使用 joblib 保存模型。 Later i tried loading that model and predicting label in my test.csv .后来我尝试在我的 test.csv 中加载该模型并预测标签。 When ever i try this i get an error saying "X has 1433445 features per sample; expecting 3797015" This is my initial code:-每当我尝试此操作时，我都会收到一条错误消息，提示“X 每个样本具有 1433445 个特征；期待 3797015”这是我的初始代码：-

import numpy as np 
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression


#reading data 
train=pd.read_csv('train_yesindia.csv')
test=pd.read_csv('test_yesindia.csv')

train=train.iloc[:,1:]
test=test.iloc[:,1:]

test.info()
train.info()

test['label']='t'

test=test.fillna(' ')
train=train.fillna(' ')
test['total']=test['title']+' '+test['author']+test['text']
train['total']=train['title']+' '+train['author']+train['text']


transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
counts = count_vectorizer.fit_transform(train['total'].values)
tfidf = transformer.fit_transform(counts)


targets = train['label'].values
test_counts = count_vectorizer.transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)

#split in samples
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf, targets, random_state=0)



logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)
print('Accuracy of Lasso classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Lasso classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))


targets = train['label'].values
logreg = LogisticRegression()
logreg.fit(counts, targets)

example_counts = count_vectorizer.transform(test['total'].values)
predictions = logreg.predict(example_counts)
pred=pd.DataFrame(predictions,columns=['label'])
pred['id']=test['id']
pred.groupby('label').count()

#dumping models
from joblib import dump, load
dump(logreg,'mypredmodel1.joblib')

Later i loaded model in a different code that is :-后来我用不同的代码加载了模型：-

import numpy as np 
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from joblib import dump, load

test=pd.read_csv('test_yesindia.csv')
test=test.iloc[:,1:]
test['label']='t'
test=test.fillna(' ')
test['total']=test['title']+' '+test['author']+test['text']

#check
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))


test_counts = count_vectorizer.fit_transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#check

#load_model

logreg = load('mypredmodel1.joblib')


example_counts = count_vectorizer.fit_transform(test['total'].values)
predictions = logreg.predict(example_counts)

When i run it, i get the error:当我运行它时，我收到错误：

predictions = logreg.predict(example_counts)
Traceback (most recent call last):

  File "<ipython-input-58-f28afd294d38>", line 1, in <module>
    predictions = logreg.predict(example_counts)

  File "C:\Users\adars\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 289, in predict
    scores = self.decision_function(X)

  File "C:\Users\adars\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 270, in decision_function
    % (X.shape[1], n_features))

ValueError: X has 1433445 features per sample; expecting 3797015

Answer 1

Most probably, this is because you are re-fitting your transformers in the test set.最有可能的是，这是因为您正在测试集中重新安装变压器。 This must not be done - you should also save them fitted in your training set, and use the test (or any other future) set only for transforming data.不能这样做 - 您还应该将它们保存在您的训练集中，并将测试（或任何其他未来）集仅用于转换数据。

This is easier done with pipelines.使用管道更容易做到这一点。

So, remove the following code from your first block:因此，从您的第一个块中删除以下代码：

transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
counts = count_vectorizer.fit_transform(train['total'].values)
tfidf = transformer.fit_transform(counts)


targets = train['label'].values
test_counts = count_vectorizer.transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)

and replace it with:并将其替换为：

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
                ('counts', CountVectorizer(ngram_range=(1, 2)),
                ('tf-idf', TfidfTransformer(smooth_idf=False))
            ])

pipeline.fit(train['total'].values)

tfidf = pipeline.transform(train['total'].values)
targets = train['label'].values

test_tfidf = pipeline.transform(test['total'].values)

dump(pipeline, 'transform_predict.joblib')

Now, in your second code block, remove this part:现在，在您的第二个代码块中，删除此部分：

#check
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))

test_counts = count_vectorizer.fit_transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#check

and replace it with:并将其替换为：

pipeline = load('transform_predict.joblib')
test_tfidf = pipeline.transform(test['total'].values)

And you should be fine, provided that you predict the test_tfidf variable, and not the example_counts which are not transfomed by TF-IDF:你应该没问题，前提是你predict了test_tfidf变量，而不是没有被 TF-IDF 转换的example_counts ：

predictions = logreg.predict(test_tfidf)

使用经过训练的模型进行预测

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-03-23 16:09:04

使用经过训练的模型进行预测

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-03-23 16:09:04

解决方案1
1 已采纳 2020-03-23 16:09:04