[英]How to save a scikit-learn classifier that utilizes a vectorizer, a pipeline and GridSearchV?
我使用以下步驟構建了一個情緒分類器:
load dataset with pandas
count = CountVectorizer()
bag = count.fit_transform(x)
bag.toarray()
tfidf = TfidfTransformer(use_idf=True, norm="l2",smooth_idf=True)
tfidf.fit_transform(bag).toarray()
from collections import Counter
vocab = Counter()
for text in x:
for word in text.split(" "):
vocab[word] += 1
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
vocab_reduced = Counter()
for w, c in vocab.items():
if not w in stop:
vocab_reduced[w]=c
def preprocessor(text):
""" Return a cleaned version of text
"""
# Remove HTML markup
text = re.sub('<[^>]*>', '', text)
# Save emoticons for later appending
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
# Remove any non-word character and append the emoticons,
# removing the nose character for standarization. Convert to lower case
text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
return text
from nltk.stem import PorterStemmer
porter = PorterStemmer()
def tokenizer(text):
return text.split()
def tokenizer_porter(text):
return [porter.stem(word) for word in text.split()]
tfidf = TfidfVectorizer(strip_accents=None,
lowercase=False,
preprocessor=None)
param_grid = [{'vect__ngram_range': [(1, 1)],
'vect__stop_words': [stop, None],
'vect__tokenizer': [tokenizer, tokenizer_porter],
'vect__preprocessor': [None, preprocessor],
'vect__use_idf':[False],
'vect__norm':[None],
"clf__alpha":[0,1],
"clf__fit_prior":[False,True]},
]
multi_tfidf = Pipeline([("vect", tfidf),
( "clf", MultinomialNB())])
gs_multi_tfidf = GridSearchCV(multi_tfidf, param_grid,
scoring="accuracy",
cv=5,
verbose=1,
n_jobs=-1)
gs_multi_tfidf.fit(X_train,y_train)
我嘗試使用 joblib 保存管道並保存分類器和管道,然后將其用於網站。 但每次我嘗試,它都沒有奏效。 我要么得到: ValueError: not enough values to unpack (expected 2, got 1)
(當保存了管道和分類器時) 要么TypeError: 'module' object is not callable
。
請嘗試使用以下內容。 為什么不包括 CountVectorizer() 和 TfidfTransformer() 的任何具體原因? 您還應該准確指定如何嘗試保存 model。
multi_tfidf = Pipeline([("vect", TfidfVectorizer()),
( "clf", MultinomialNB())])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.