简体   繁体   English

在数据框上对OLS回归模型进行交叉验证

[英]Make a cross validation on a dataframe for an OLS regression model

i have a dataframe like this (it's bigger and there are more features): 我有一个这样的数据框(它更大并且有更多功能):

        Date  Influenza[it]  Febbre[it]  Cefalea[it]  Paracetamolo[it]  \
0    2008-01            989        2395         1291              2933   
1    2008-02            962        2553         1360              2547   
2    2008-03           1029        2309         1401              2735   
3    2008-04           1031        2399         1137              2296   
       ...              ...

     tot_incidence  
0           4.56  
1           5.98  
2           6.54  
3           6.95  
            ....

First of all i made a ols regression on the dataframe without splitting in training/test sets and this is the 'input configuration' that worked ( tot_incidence is to predict, Influenza[it] , Febbre[it] and Cefalea[it] are the features): 首先,我在数据帧上进行了ols回归,而没有拆分训练/测试集,这是有效的“输入配置”( tot_incidence可以预测, Influenza[it]Febbre[it]Cefalea[it]是特征):

fin1=fin1.rename(columns = {'tot_incidence':'A','Influenza[it]':'B', 'Febbre[it]':'C','Cefalea[it]':'D'})
result = sm.ols(formula="A ~ B + C + D", data=fin1).fit()

OK. 好。 Now i want to make a training and test set. 现在,我想进行培训和测试。

Tried classic split and k-fold 尝试过经典开叉和k折

1° Classic split 1°经典分体式

Probably that's easier, I could do this: 可能更简单,我可以这样做:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(x, y, test_size=0.3, random_state=1)

And then insert the variables in the OLS model: 然后将变量插入OLS模型:

x_train = sm.add_constant(X_train)
model = sm.OLS(y_train, x_train)
results = model.fit()
predictions = results.predict(X_test)

In this case how can i make the x,y from the dataframe to insert them in the cross_validation.train_test_split function? 在这种情况下,我如何从数据cross_validation.train_test_split制作x,y并将其插入cross_validation.train_test_split函数中?

2° K-fold (if too hard don't waste time on it) 2度K折(如果太难了,不要浪费时间)

For example i could do this: 例如,我可以这样做:

from sklearn import cross_validation
array = dataframe.values
X = array[:,1:3]
Y = array[:,5]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)

At this point i'm stuck, how can i insert this variable in the ols to make then the prediction? 在这一点上,我被困住了,我该如何在ols中插入此变量以进行预测呢? Is there a better way to make the training/test sets? 有没有更好的方法来制作训练/测试集?

In this case how can i make the x,y from the dataframe to insert them in the cross_validation.train_test_split function? 在这种情况下,我如何从数据框中制作x,y并将其插入cross_validation.train_test_split函数中?

You need to convert dataframe columns into inputs ( x,y ) that an algorithm can understand, ie convert columns of a dataframe into either numbers or categories, depending on the type of algorithm you are trying to perform. 您需要将数据框的列转换为算法可以理解的输入( x,y ),即,根据要尝试执行的算法类型,将数据框的列转换为数字或类别。

1) Select the variable in your dataframe that is your response/predictor, ie your Y variable. 1)在数据框中选择作为响应/预测变量的变量,即Y变量。 Say that's Influenza : 说是Influenza
y = df.Influenze.values # convert to a numpy array

2) Select the X variables, say Febbre, Cefalea, Paracetamolo : 2)选择X变量,例如Febbre, Cefalea, Paracetamolo
X = np.column_stack([df.Febbre.values, df.Cefalea.values, df.Paracetamolo.values])

Now you can call the cross_validation.train_test_split function. 现在,您可以调用cross_validation.train_test_split函数。

Note that if your variables are categories, then you'll have to use some sort of categorization, such as one-hot . 请注意,如果您的变量是类别,则必须使用某种分类,例如one-hot

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM