简体   繁体   中英

How can I make a single prediction on a model using sklearn in Python?

I have trained a machine-learning model on a dataset of companies using sklearn. The dataset has the following attributes: name, domain, year_founded, industry, size_range, locality, country, linkedin_url, current_employee_estimate, total_employee_estimate .

I want to train a machine-learning model to try to predict a size_range value - which falls into one of eight categories, based on the company's size - using the name and year_founded attributes. I have done this using the following training code:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import logistic
from tools import pickleFile
from tools import unpickleFile
from tools import cleanDataset
from tools import getPrettyTimestamp
import sklearn
import pandas as pd
import numpy as np
import datetime
import sys


def train_model(clf, X_train, y_train, epochs=10):
    """
    Trains a specific model and returns a list of results

    :param clf: sklearn model
    :param X_train: encoded training data (attributes)
    :param y_train: training data (attribute to predict
    :param epochs: number of iterations (default=10)
    :return: result (accuracy) for this training data
    """
    scores = []
    print("Starting training...")
    for i in range(1, epochs + 1):
        print("Epoch:" + str(i) + "/" + str(epochs) + " -- " + str(datetime.datetime.now()))
        clf.fit(X_train, y_train)
        score = clf.score(X_train, y_train)
        scores.append(score)
    print("Done training.  The score(s) is/are: " + str(scores))
    return scores

def main():

    # Parse the arguments.
    userRequestedTrain, filename = parseArgs()

    # Some custom Pandas settings - TODO remove this
    pd.set_option('display.max_columns', 30)
    pd.set_option('display.max_rows', 1000)

    dataset = pd.read_csv("companies_sorted.csv", nrows=50000)


    origLen = len(dataset)
    print(origLen)

    dataset = cleanDataset(dataset)

    cleanLen = len(dataset)
    print(cleanLen)

    print("\n======= Some Dataset Info =======\n")
    print("Dataset size (original):\t" + str(origLen))
    print("Dataset size (cleaned):\t" + str(len(dataset)))
    print("\nValues of size_range:\n")
    print(dataset['size_range'].value_counts())
    print()

    # size_range is the attribute to be predicted, so we pop it from the dataset
    sizeRange = dataset.pop("size_range").values

    # We split our dataset and attribute-to-be-preditcted into training and testing subsets.
    xTrain, xTest, yTrain, yTest = train_test_split(dataset, sizeRange, test_size=0.25, random_state=1)


    print(xTrain.transpose())
    le = LabelEncoder()
    ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

    # Our feature set, i.e. the inputs to our machine-learning model.
    featureSet = ['name', 'year_founded']

    # Making a copy of test and train sets with only the columns we want.
    xTrain_sf = xTrain[featureSet].copy()
    xTest_sf = xTest[featureSet].copy()

    # Apply one-hot encoding to columns
    ohe.fit(xTrain_sf)

    print(xTrain_sf)
    print(xTest_sf)

    featureNames = ohe.get_feature_names()

    # Encoding test and train sets
    xTrain_sf_encoded = ohe.transform(xTrain_sf)
    xTest_sf_encoded = ohe.transform(xTest_sf)

    # ------ Using Logistic Regression classifier - TRAINING PHASE ------

    if userRequestedTrain:
        # We define the model we're going to use.
        lrModel = LogisticRegression(solver='lbfgs', multi_class="multinomial", max_iter=1000, random_state=1)

        # Now, let's train it.
        lrScores = train_model(lrModel, xTrain_sf_encoded, yTrain, 1)

        # Save the model as a file.
        filename = "models/Model_" + getPrettyTimestamp()
        print("Training done! Pickling model to " + str(filename) + "...")
        pickleFile(lrModel, filename)

    # Reload the model for testing.  If we didn't train the model ourselves, then it was specified as an argument.
    lrModel = unpickleFile(filename)

    PRED = lrModel.predict(xTrain_sf_encoded[0:10])

    print("Unpickled successfully from file " + str(filename))

    # ------- TESTING PHASE -------

    testLrScores = train_model(lrModel, xTest_sf_encoded, yTest, 1)

    if userRequestedTrain:
        trainScore = lrScores[0]
    else:
        trainScore = 0.9201578143173162  # Modal training score - substitute if we didn't train model ourselves

    testScore = testLrScores[0]

    scores = sorted([(trainScore, 'train'), (testScore, 'test')], key=lambda x: x[0], reverse=True)
    better_score = scores[0]  # largest score
    print(scores)

    # Which score was better?
    print("Better score: %s" % "{}".format(better_score))

    print("Pickling....")

    pickleFile(lrModel, "models/TESTING_" + getPrettyTimestamp())

This code runs successfully - the training and testing phases complete, with the testing phase having about 60% accuracy:

Starting training...
Epoch:1/1 -- 2019-12-18 20:03:13.462479
Done training.  The score(s) is/are: [0.8854667949951877]
Training done! Pickling model to models/Model_2019-12-18_2003...
Unpickled successfully from file models/Model_2019-12-18_2003
= = = = = = = = = = = = = = = = = = = 

First 10 predictions:

['5001 - 10000' '10001+' '1001 - 5000' '5001 - 10000' '1001 - 5000'
 '1001 - 5000' '5001 - 10000' '1001 - 5000' '1001 - 5000' '1001 - 5000']
['5001 - 10000' '10001+' '1001 - 5000' '5001 - 10000' '1001 - 5000'
 '1001 - 5000' '5001 - 10000' '1001 - 5000' '1001 - 5000' '1001 - 5000']
 = = = = = = = = = = = = = 
Starting training...
Epoch:1/1 -- 2019-12-18 20:03:20.775392
Done training.  The score(s) is/are: [0.5906466512702079]
[(0.8854667949951877, 'train'), (0.5906466512702079, 'test')]
Better score: (0.8854667949951877, 'train')
Pickling....

Process finished with exit code 0

However, let's say that I want to make a SINGLE prediction using this model, ie by passing it a company name and a company's year of founding. I do the following:

lrModel = pickle.load(open(filename, 'rb'))
predictedSet = lrModel.predict([["SomeRandomCompany", 2019]])

But when I do so, I get the following ValueError:

  X = check_array(X, accept_sparse='csr')
Traceback (most recent call last):
  File "/home/ivor/Documents/companySizeEstimator/companySizeEstimator.py", line 85, in <module>
    main()
  File "/home/ivor/Documents/companySizeEstimator/companySizeEstimator.py", line 58, in main
    predictions(model, reducedSetEncoded, reducedSet)
  File "/home/ivor/Documents/companySizeEstimator/companySizeEstimator.py", line 80, in predictions
    predictedSet = lrModel.predict([["SomeCompany", 2019]])
  File "/home/ivor/Documents/companySizeEstimator/venv/lib/python3.8/site-packages/sklearn/linear_model/_base.py", line 293, in predict
    scores = self.decision_function(X)
  File "/home/ivor/Documents/companySizeEstimator/venv/lib/python3.8/site-packages/sklearn/linear_model/_base.py", line 272, in decision_function
    raise ValueError("X has %d features per sample; expecting %d"
ValueError: X has 2 features per sample; expecting 54897

It seems to want a dataset the exact same shape as the one used to train it, ie one with 11,000 rows. It can give predictions just fine in the testing phase of the problem, so clearly the model is able to make predictions just fine. How could I get it to make a prediction based on just one value, as shown above?

When you train the model you are using dataset of having N features,model is expecting same number of features for prediction as well.because your model trained by looking at those N features and to predict, it requires same dimensions .that`s why you get X has 2 features per sample; expecting 54897 error.

One thing you can do is create the matrix or df with zeros that matches the required dimensions(N) and fill values that are use to predict in exact position of the df.

I think you should double check your df used for training: xTrain_sf_encoded , it should be a 2-column DataFrame, while for some reason it has 54,987.

One more thing, why are you doing this in the testing phase ?

testLrScores = train_model(lrModel, xTest_sf_encoded, yTest, 1)

you are re-training the model, while I believe you would like to test it like so:

# Print Predictions
yPred = lrModel.predict(xTest_sf_encoded)
print(yPred)
# Print the actual values
print(yTest)
# Compare
print(yPred==yTest)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM