在Scikit Learn中通过已保存的训练分类器进行预测

Question

我为Python中的Tweets编写了一个分类器，然后我将它以.pkl格式保存在磁盘上，这样我就可以一次又一次地运行它而无需每次都进行训练。 这是代码：

import pandas
import re
from sklearn.feature_extraction import FeatureHasher

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from sklearn import cross_validation

from sklearn.externals import joblib


#read the dataset of tweets

header_row=['sentiment','tweetid','date','query', 'user', 'text']
train = pandas.read_csv("training.data.csv",names=header_row)

#keep only the right columns

train = train[["sentiment","text"]]

#remove puctuation, special characters, numbers and lower case the text

def remove_spch(text):

    return re.sub("[^a-z]", ' ', text.lower())

train['text'] = train['text'].apply(remove_spch)


#Feature Hashing

def tokens(doc):
    """Extract tokens from doc.

    This uses a simple regex to break strings into tokens.
    """
    return (tok.lower() for tok in re.findall(r"\w+", doc))

n_features = 2**18
hasher = FeatureHasher(n_features=n_features, input_type="string", non_negative=True)
X = hasher.transform(tokens(d) for d in train['text'])

y = train['sentiment']

X_new = SelectKBest(chi2, k=20000).fit_transform(X, y)

a_train, a_test, b_train, b_test = cross_validation.train_test_split(X_new, y, test_size=0.2, random_state=42)

from sklearn.ensemble import RandomForestClassifier 

classifier=RandomForestClassifier(n_estimators=10)                  
classifier.fit(a_train.toarray(), b_train)                            
prediction = classifier.predict(a_test.toarray()) 

#Export the trained model to load it in another project

joblib.dump(classifier, 'my_model.pkl', compress=9)

假设我有另一个Python文件，我想对Tweet进行分类。 我该如何进行分类？

from sklearn.externals import joblib
model_clone = joblib.load('my_model.pkl')

mytweet = 'Uh wow:@medium is doing a crowdsourced data-driven investigation tracking down a disappeared refugee boat'

直到hasher.transform我可以复制相同的程序将其添加到预测模型，但后来我hasher.transform无法计算最佳20k功能的问题。 要使用SelectKBest，您需要添加功能和标签。 既然，我想预测标签，我不能使用SelectKBest。 那么，我怎样才能通过这个问题继续进行预测呢？

Answer 1

我支持@EdChum的评论

你可以通过对数据进行训练来建立模型，这些数据可能具有足够的代表性，可以应对看不见的数据

实际上，这意味着您需要将FeatureHasher和SelectKBest同时FeatureHasher 仅包含predict的新数据。 （在新数据上重新训练FeatureHasher是错误的 ，因为通常它会产生不同的特征）。

要做到这一点

pickle FeatureHasher和SelectKBest分开

或更好）

制作FeatureHasher，SelectKBest和RandomForestClassifier的Pipeline并RandomForestClassifier整个管道。 然后，您可以加载此管道并对新数据使用predict 。

在Scikit Learn中通过已保存的训练分类器进行预测

问题描述

1 个解决方案

解决方案1
5 2015-10-07 14:42:55

在Scikit Learn中通过已保存的训练分类器进行预测

问题描述

1 个解决方案

解决方案1 5 2015-10-07 14:42:55

解决方案1
5 2015-10-07 14:42:55