简体   繁体   English

Sklearn 分类器和烧瓶问题

[英]Sklearn classifier and flask issues

I have been trying to self host with apache an sklearn classifier that I put together, and I ended up using joblib to serialize the saved model, then load it in a flask app.我一直在尝试使用apache自托管一个我放在一起的sklearn分类器,最后我使用joblib来序列化保存的模型,然后将其加载到joblib应用程序中。 Now, this app worked perfectly when running flask's built in development server, but when I set this up with a debian 9 apache server, I get a 500 error.现在,这个应用程序在运行 Flask 的内置开发服务器时运行良好,但是当我使用 debian 9 apache 服务器设置它时,我收到 500 错误。 Delving into apache's error.log , I get:深入研究 apache 的error.log ,我得到:

AttributeError: module '__main__' has no attribute 'tokenize'

Now, this is funny to me because while I did write my own tokenizer, the web app gave me no problems when I was running it locally.现在,这对我来说很有趣,因为虽然我确实编写了自己的标记器,但当我在本地运行时,Web 应用程序没有给我带来任何问题。 Furthermore, the saved model that I used was trained on the webserver, so slightly different library versions should not be a problem.此外,我使用的保存模型是在网络服务器上训练的,因此库版本略有不同应该不成问题。

My code for the web app is:我的网络应用程序代码是:

import re
import sys

from flask import Flask, request, render_template
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.externals import joblib

app = Flask(__name__)



def tokenize(text):
    # text = text.translate(str.maketrans('','',string.punctuation))
    text = re.sub(r'\W+', ' ', text)
    tokens = word_tokenize(text)
    lemas = []
    for item in tokens:
        lemas.append(WordNetLemmatizer().lemmatize(item))
    return lemas

@app.route('/')
def home():
    return render_template('home.html')

@app.route('/analyze',methods=['POST','GET'])
def analyze():
    if request.method=='POST':
        result=request.form
        input_text = result['input_text']

        clf = joblib.load("model.pkl.z")
        parameters = clf.named_steps['clf'].get_params()
        predicted = clf.predict([input_text])
        # print(predicted)
        certainty = clf.decision_function([input_text])

        # Is it bonkers?
        if predicted[0]:
            verdict = "Not too nuts!"
        else:
            verdict = "Bonkers!"

        return render_template('result.html',prediction=[input_text, verdict, float(certainty), parameters])

if __name__ == '__main__':
    #app.debug = True
    app.run()

With the .wsgi file being: .wsgi 文件是:

import sys 
sys.path.append('/var/www/mysite')

from conspiracydetector import app as application

Furthermore, I trained the model with this code:此外,我用以下代码训练了模型:

import logging
import pprint  # Pretty stuff
import re
import sys  # For command line arguments
from time import time  # to show progress

import numpy as np
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn import metrics
from sklearn.datasets import load_files
from sklearn.externals import joblib  # In order to save
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

# Tokenizer that does stemming and strips punctuation
def tokenize(text):
    # text = text.translate(str.maketrans('','',string.punctuation))
    text = re.sub(r'\W+', ' ', text)
    tokens = word_tokenize(text)
    lemas = []
    for item in tokens:
        lemas.append(WordNetLemmatizer().lemmatize(item))
    return lemas

if __name__ == "__main__":
    # NOTE: we put the following in a 'if __name__ == "__main__"' protected
    # block to be able to use a multi-core grid search that also works under
    # Windows, see: http://docs.python.org/library/multiprocessing.html#windows
    # The multiprocessing module is used as the backend of joblib.Parallel
    # that is used when n_jobs != 1 in GridSearchCV

    # Display progress logs on stdout
    print("Initializing...")
    # Command line arguments
    save = sys.argv[1]
    training_directory = sys.argv[2]

    logging.basicConfig(level=logging.INFO,
                        format='%(asctime)s %(levelname)s %(message)s')

    dataset = load_files(training_directory, shuffle=False)
    print("n_samples: %d" % len(dataset.data))

    # split the dataset in training and test set:
    print("Splitting the dataset in training and test set...")
    docs_train, docs_test, y_train, y_test = train_test_split(
        dataset.data, dataset.target, test_size=0.25, random_state=None)

    # Build a vectorizer / classifier pipeline that filters out tokens
    # that are too rare or too frequent
    # Also remove stop words
    print("Loading list of stop words...")
    with open('stopwords.txt', 'r') as f:
        words = [line.strip() for line in f]

    print("Stop words list loaded...")
    print("Setting up pipeline...")
    pipeline = Pipeline(
        [
            # ('vect', TfidfVectorizer(stop_words=words, min_df=0.001, max_df=0.5, ngram_range=(1,1))),
            ('vect',
             TfidfVectorizer(tokenizer=tokenize, stop_words=words, min_df=0.001, max_df=0.5, ngram_range=(1, 1))),
            ('clf', LinearSVC(C=5000)),
        ])

    print("Pipeline:", [name for name, _ in pipeline.steps])

    # Build a grid search to find out whether unigrams or bigrams are
    # more useful.
    # Fit the pipeline on the training set using grid search for the parameters
    print("Initializing grid search...")

    # uncommenting more parameters will give better exploring power but will
    # increase processing time in a combinatorial way
    parameters = {
        # 'vect__ngram_range': [(1, 1), (1, 2)],
        # 'vect__min_df': (0.0005, 0.001),
        # 'vect__max_df': (0.25, 0.5),
        # 'clf__C': (10, 15, 20),
    }
    print("Parameters:")
    pprint.pprint(parameters)
    grid_search = GridSearchCV(
        pipeline,
        parameters,
        n_jobs=-1,
        verbose=True)

    print("Training and performing grid search...\n")
    t0 = time()
    grid_search.fit(docs_train, y_train)
    print("\nDone in %0.3fs!\n" % (time() - t0))

    # Print the mean and std for each candidate along with the parameter
    # settings for all the candidates explored by grid search.
    n_candidates = len(grid_search.cv_results_['params'])
    for i in range(n_candidates):
        print(i, 'params - %s; mean - %0.2f; std - %0.2f'
              % (grid_search.cv_results_['params'][i],
                 grid_search.cv_results_['mean_test_score'][i],
                 grid_search.cv_results_['std_test_score'][i]))

    # Predict the outcome on the testing set and store it in a variable
    # named y_predicted
    print("\nRunning against testing set...\n")
    y_predicted = grid_search.predict(docs_test)

    # Save model
    print("\nSaving model to", save, "...")
    joblib.dump(grid_search.best_estimator_, save)
    print("Model Saved! \nPrepare for some awesome stats!")

I must confess that I am pretty stumped, and after tinkering around, searching, and making sure that my server is configured correctly, I felt that perhaps someone here might be able to help.我必须承认,我很难过,在摆弄、搜索并确保我的服务器配置正确之后,我觉得也许这里有人可以提供帮助。 Any help is appreciated, and if there is any more information that I need to provide, please let me know and I will be happy to.感谢您提供任何帮助,如果我需要提供更多信息,请告诉我,我将很乐意提供。

Also, I am running:另外,我正在运行:

  • python 3.5.3 with nltk and sklearn.带有 nltk 和 sklearn 的 python 3.5.3。

I solved this problem, although imperfectly, by removing my custom tokenizer and falling back on one of sklearn's.我通过删除我的自定义标记器并回到 sklearn 之一解决了这个问题,尽管不完美。

However, I am still in the dark on how to integrate my own tokenizer.但是,对于如何集成我自己的标记器,我仍然一无所知。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM