简体   繁体   中英

test machine learning model with categorical variable in python

I have a data set like this

在此处输入图片说明

As you can see there is one categorical variable which is state

later I encode categorical variable

在此处输入图片说明

If I want to test my model with specific data I do something like this

print(regressor.predict([[1,0,1000,2000,3000]]))

Which works fine . But what I want to do is , while testing I directly want to input the city name , like New York or Florida

How can I achieve this ?

A machine learning model can only work on numeric data. This is the reason why you had to encode your "states". There are few ways to achieve what you are saying: a) Use a function to return encoded value of the "state" while you can enter something like

print(regressor.predict([[1,0,1000,func("New York"),3000]]))

b) Use implicit encoding, which creates as many columns for each categorical variable implicitly.

由于ML模型仅输入数字,因此即使对测试数据集也必须进行编码,然后将其传递给模型。

You could use scikit-Learn LabelEncoder for transforming and inverse transforming the categorical value.

ie)

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["New York", "Florida", "US", "Florida", "New York"])
LabelEncoder()
>>> le.transform(["New York", "Florida", "US", "Florida", "New York"]) 
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0])
"New York"

You can call your function like below.

print(regressor.predict([[1,0,1000,le.transform(["New York"])[0],3000]]))

As others have mentioned before, any model takes only numbers as inputs. For this reason, usually we create a preprocessing function which can be applied to both the train and test sets at once.

In this case, you need to define a function which transforms the input vector into a numerical vector which can be further fed to your machine learning model:

Inputs -> Preprocessing -> Model

This preprocessing needs to be just like what you used for training so that you achieve the results you want to.

So typically when you create a model, your complete 'Model' can actually be a wrapper around the actual model that you use. For instance:

class MyModel():

    def __init__(self,):
        # Inputs and other variables like hyperparameters
        self.model = Model() # Initialise a model of your choice

    def preprocess(self, list_to_preprocess):
        # Preprocess this list

    def train(self, train_set):
        X_train, y_train = preprocess(X_train)
        self.model.fit(X_train, y_train)

    def predict(self, test_set):
        # If X_test is a vector, reshape and then preprocess

        X_test, y_test = preprocess(test_set)
        pred = self.model.predict(X_test)

        # Evaluate using pred and y_test

So finally to predict you use the function MyModel.predict() and not Model.predict() to achieve what you want to.

This is not elegant at all, but you can just write if... elif statement depending on the input, like:

a = input("Please enter the state: ") 
if a = "New York":
    print(regressor.predict([[1,0,1000,2000,3000]]))
elif a = "Florida":
    print(regressor.predict([[0,1,1000,2000,3000]]))
else:
    print("Invalid state selected")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM