简体   繁体   English

如何根据用户输入(只有一条记录)创建用于预测的虚拟变量?

[英]How to do create dummy variables for prediction from user input (only one record)?

I am trying to create a web application for predicting airline delays. 我正在尝试创建一个用于预测航空公司延误的网络应用程序。 I have trained my model offline on my computer, and now am trying to make a Flask app to make predictions based on user input. 我已经在我的计算机上离线训练了我的模型,现在我正在尝试使用Flask应用程序根据用户输入进行预测。 For simplicity, lets say my model has 3 categorical variables: UNIQUE_CARRIER, ORIGIN and DESTINATION. 为简单起见,假设我的模型有3个分类变量:UNIQUE_CARRIER,ORIGIN和DESTINATION。 While training, I create dummy variables of all 3 using pandas: 在训练时,我使用pandas创建了所有3个虚拟变量:

df = pd.concat([df, pd.get_dummies(df['UNIQUE_CARRIER'], drop_first=True, prefix="UNIQUE_CARRIER")], axis=1)
df = pd.concat([df, pd.get_dummies(df['ORIGIN'], drop_first=True, prefix="ORIGIN")], axis=1)
df = pd.concat([df, pd.get_dummies(df['DEST'], drop_first=True, prefix="DEST")], axis=1)
df.drop(['UNIQUE_CARRIER', 'ORIGIN', 'DEST'], axis=1, inplace=True)

So now my feature vector is 297 long (assuming there are 100 different unique carriers and 100 different airports in my data). 所以现在我的特征向量是297长(假设我的数据中有100个不同的唯一载波和100个不同的机场)。 I saved my model using pickle, and now am trying to predict based on user input. 我使用pickle保存了我的模型,现在我正在尝试根据用户输入进行预测。 Now the user input is in the form of 3 variables (origin, destination, carrier). 现在用户输入的形式为3个变量(原点,目的地,载波)。

Obviously I cannot use pd.get_dummies (because there would be only 1 unique value for all the three fields) for each user input. 显然,我不能为每个用户输入使用pd.get_dummies (因为所有三个字段只有1个唯一值)。 What is the most efficient way to convert the user input into the feature vector for my model? 将用户输入转换为我的模型的特征向量的最有效方法是什么?

Since you are using pandas dummies and hence dense vectors, a good way to create a new vector would be to create a dict of terms:vector_index and then populate a zeros vector according to it, something along the lines of the following: 由于您使用的是熊猫假人,因此使用密集矢量,创建新矢量的好方法是创建术语的字典:vector_index然后根据它填充零矢量,类似于以下内容:

index_dict = dict(zip(df.columns,range(df.shape[1])))

now when you have a new flight: 现在当你有一个新航班:

new_vector = np.zeroes(297)
try:
    new_vector[index_dict[origin]] = 1
except:
    pass
try:
    new_vector[index_dict[destination]] = 1
except:
    pass
try:
    new_vector[index_dict[carrier]] = 1
except:
    pass

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM