簡體   English   中英

如何保存使用矢量化器、管道和 GridSearchV 的 scikit-learn 分類器?

[英]How to save a scikit-learn classifier that utilizes a vectorizer, a pipeline and GridSearchV?

我使用以下步驟構建了一個情緒分類器:

load dataset with pandas

count = CountVectorizer()
bag = count.fit_transform(x)
bag.toarray()
tfidf = TfidfTransformer(use_idf=True, norm="l2",smooth_idf=True)
tfidf.fit_transform(bag).toarray()

from collections import Counter

vocab = Counter()
for text in x:
    for word in text.split(" "):
        vocab[word] += 1

import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')

vocab_reduced = Counter()
for w, c in vocab.items():
    if not w in stop:
        vocab_reduced[w]=c

def preprocessor(text):
    """ Return a cleaned version of text
    """
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons,
    # removing the nose character for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    
    return text

from nltk.stem import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__preprocessor': [None, preprocessor],
               'vect__use_idf':[False],
               'vect__norm':[None],
               "clf__alpha":[0,1],
               "clf__fit_prior":[False,True]},
                ]
multi_tfidf = Pipeline([("vect", tfidf),
                       ( "clf", MultinomialNB())])
gs_multi_tfidf = GridSearchCV(multi_tfidf, param_grid,
                              scoring="accuracy",
                              cv=5,
                              verbose=1,
                              n_jobs=-1)
gs_multi_tfidf.fit(X_train,y_train)

我嘗試使用 joblib 保存管道並保存分類器和管道,然后將其用於網站。 但每次我嘗試,它都沒有奏效。 我要么得到: ValueError: not enough values to unpack (expected 2, got 1) (當保存了管道和分類器時) 要么TypeError: 'module' object is not callable

請嘗試使用以下內容。 為什么不包括 CountVectorizer() 和 TfidfTransformer() 的任何具體原因? 您還應該准確指定如何嘗試保存 model。

multi_tfidf = Pipeline([("vect", TfidfVectorizer()),
                       ( "clf", MultinomialNB())])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM