简体   繁体   English

sklearn LinearRegression.Predict()问题

[英]sklearn LinearRegression.Predict() issue

I am trying to predict call volume for a call center based on various other factors. 我试图根据各种其他因素预测呼叫中心的呼叫量。 I have a fairly clean dataset, fairly small as well, but enough. 我有一个相当干净的数据集,相当小,但足够了。 I am able to train and test historical data and get a score, summary, etc. I am for the life of me unable to figure out how to then get it to predict future calls using forecasted factor data. 我能够训练和测试历史数据并获得分数,摘要等。我为我的生活无法弄清楚如何使用预测因子数据来预测未来的呼叫。 My data is below: 我的数据如下:

Date    DayNum  factor1 factor2 factor3 factor4 factor5 factor6 factor7 factor8 factor9 VariableToPredict
9/17/2014   1   592 83686.46    0   0   250 15911.8 832 99598.26    177514  72
9/18/2014   2   1044    79030.09    0   0   203 23880.55    1238    102910.64   205064  274
9/19/2014   3   707 84207.27    0   0   180 8143.32 877 92350.59    156360  254
9/20/2014   4   707 97577.78    0   0   194 16688.95    891 114266.73   196526  208
9/21/2014   5   565 83084.57    0   0   153 13097.04    713 96181.61    143678  270

The code I have so far is below: 我到目前为止的代码如下:

from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
import pandas as pd

d = pd.read_csv("H://My Documents//Python Scripts//RawData//Q2917.csv", "r", delimiter=",")
e = pd.read_csv("H://My Documents//Python Scripts//RawData//FY16q2917Test.csv", "r", delimiter=",")
#print(d)
#b = pd.DataFrame.as_matrix(d)
#print(b)
x = d.as_matrix(['factor2', 'factor4', 'factor5', 'factor6'])    
y = d.as_matrix(['VariableToPredict'])
x1 = e.as_matrix(['factor2', 'factor4', 'factor5', 'factor6'])
y1 = e.as_matrix(['VariableToPredict'])
#print(len(train))
#print(target)
#use scaler
scalerX = StandardScaler()
train = scalerX.fit_transform(x1)
scalerY = StandardScaler()
target = scalerY.fit_transform(y1)

clf = LinearRegression(fit_intercept=True)
cv = KFold(len(train), 10, shuffle=True, random_state=33)


#decf = LinearRegression.decision_function(train, target)
test = LinearRegression.predict(train, target)
score = cross_val_score(clf,train, target,cv=cv )

print("Score: {}".format(score.mean()))

This of course gives me the error that there are nulls in the y values, which there are because it is blank and I am trying to predict it. 这当然给了我一个错误,即y值中有空值,因为它是空白的,我试图预测它。 The problem here is, I am new enough to python that I am fundamentally misunderstanding how this should be built. 这里的问题是,我对python足够新,我从根本上误解了应该如何构建它。 even if it worked this way, it wouldn't be correct, it isn't taking into account the past data when building the model to predict the future. 即使它以这种方式工作,也是不正确的,在构建模型以预测未来时,它没有考虑过去的数据。 Do I need to have these in the same file possibly? 我是否需要将这些文件放在同一个文件中? if so, How to I tell it to consider these 3 columns from row a to row b, predict the dependent column for the same rows, then apply that model to analyze those three columns for the future data and predict the future calls. 如果是这样,我如何告诉它考虑从行a到行b的这3列,预测相同行的从属列,然后应用该模型来分析这三列以获取未来数据并预测未来的调用。 I don't expect the whole answer here, this is my job to do, but any small clues would be greatly appreciated. 我不希望这里有完整的答案,这是我的工作,但任何小线索都会非常感激。

In order to build a regression model, you need training data and training scores. 为了建立回归模型,您需要培训数据和培训分数。 These allow you to fit a set of regression parameters to the problem. 这些允许您将一组回归参数拟合到问题中。

Then to predict, you need prediction data, but NOT prediction scores, because you don't have these - you're trying to predict them! 然后要预测,你需要预测数据,但不是预测分数,因为你没有这些 - 你试图预测它们!

The code below, for example, will run: 例如,下面的代码将运行:

from sklearn.linear_model import LinearRegression
import numpy as np

trainingData = np.array([ [2.3,4.3,2.5], [1.3,5.2,5.2], [3.3,2.9,0.8], [3.1,4.3,4.0]  ])
trainingScores = np.array([3.4,7.5,4.5,1.6])

clf = LinearRegression(fit_intercept=True)
clf.fit(trainingData,trainingScores)

predictionData = np.array([ [2.5,2.4,2.7], [2.7,3.2,1.2] ])
clf.predict(predictionData)

It looks as though you're putting the wrong number of arguments into your predict() call - have a look at my snippet here and you should be able to work out how to change it. 看起来你在你的predict()调用中输入了错误数量的参数 - 看看我的代码片段,你应该能够找出如何更改它。

Just for interest, you can run the following line afterwards to get access to the parameters that the regression fits to the data: print repr(clf.coef_) 只是为了感兴趣,您可以在之后运行以下行以访问回归适合数据的参数: print repr(clf.coef_)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM