简体   繁体   English

使用sklearn的“预测”功能

[英]Using sklearn's 'predict' function

If I trained a model in sklearn using dummy variables for categorical values, what is best practice for feeding a single row of features into this model to get the prediction result? 如果我使用伪变量作为分类值在sklearn训练了一个模型, sklearnsklearn馈入该模型以获得预测结果的最佳实践是什么? For all input data set I am trying to get scores. 对于所有输入数据集,我试图获得分数。 If I have less columns than the data set I used to train/fit the model, will it throw an error.? 如果我的列数少于用于训练/拟合模型的数据集,它将引发错误。

Just to clarify: I took a data set that had 5 columns and created over 118 dummy columns before I built my model. 需要澄清的是:在建立模型之前,我采用了一个包含5列的数据集,并创建了118个以上的虚拟列。 Now I have a single row of data with 5 columns that I would like to use in predict function. 现在,我想在predict函数中使用一行5列的数据。 How can I do this? 我怎样才能做到这一点?

Any help here would be greatly appreciated. 在这里的任何帮助将不胜感激。

It's an error to extend features depending on table state, cause you can't repeat it with another data. 根据表状态扩展功能是一个错误,因为您无法将其与其他数据重复。 If you want to create features this way, you should use a constructor that will remember structure of features. 如果要以这种方式创建要素,则应使用能够记住要素结构的构造函数。 Since you gave no example of data, here is the main idea how you can make a constructor: 由于您没有提供数据示例,因此这里是如何构造构造函数的主要思想:

import pandas as pd

data = pd.DataFrame([['Missouri', 'center', 'Jan', 55, 11],
                     ['Kansas', 'center', 'Mar', 54, 31],
                     ['Georgia', 'east', 'Jan', 37, 18]],
                     columns=('state', 'pos', 'month', 'High Temp', 'Low Temp'))


test =  pd.DataFrame([['Missouri', 'center', 'Feb', 44, 23], 
                      ['Missouri', 'center', 'Mar', 55, 33]],
                      columns=('state', 'pos', 'month', 'High Temp', 'Low Temp'))  


class DummyColumns():
    def __init__(self, data):
        # Columns constructor
        self.empty = pd.DataFrame(columns=(list(data.columns) +
                                           list(data.state.unique()) +
                                           list(data.pos.unique()) +
                                           ['Winter', 'Not winter']))
    def __call__(self, data):
        # Initializing with zeros
        self.df = pd.DataFrame(data=0, columns=self.empty.columns, index=data.index)        
        for row in data.itertuples():
            self.df.loc[row.Index, :5] = row[1:]
            self.df.loc[row.Index, row.state] = 1
            self.df.loc[row.Index, row.pos] = 1
            if row.month in ['Dec', 'Jan', 'Feb']:
                self.df.loc[row.Index, 'Winter'] = 1
            else:
                self.df.loc[row.Index, 'Not winter'] = 1
        return self.df       

add_dummy = DummyColumns(data)
dummy_test = add_dummy(test)
print dummy_test

      state     pos month  High Temp  Low Temp  Missouri  Kansas  Georgia  \
0  Missouri  center   Feb         44        23         1       0        0   
1  Missouri  center   Mar         55        33         1       0        0   

   center  east  Winter  Not winter  
0       1     0       1           0  
1       1     0       0           1  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM