简体   繁体   English

SciKit学习,用于数据驱动的振荡数据回归

[英]SciKit-learn for data driven regression of oscillating data

Long time lurker first time poster. 长时间潜伏第一次海报。

I have data that roughly follows ay=sin(time) distribution, but also depends on other variables than time. 我的数据大致遵循ay = sin(time)分布,但还取决于时间以外的其他变量。 In terms of correlations, since the target y-variable oscillates there is almost zero statistical correlation with time, but y obviously depends very strongly on time. 在相关性方面,由于目标y变量振荡,因此与时间的统计相关性几乎为零,但是y显然非常依赖时间。

The goal is to predict the future values of the target variable. 目的是预测目标变量的未来值。 I want to avoid using an explicit assumption of the model, and instead rely on data driven models and machine learning, so I have tried using regression methods from sklearn. 我想避免使用模型的显式假设,而是依靠数据驱动的模型和机器学习,因此我尝试使用sklearn的回归方法。

I have tried the following methods (the parameters were blindly copied from examples and other threads): 我尝试了以下方法(参数是从示例和其他线程中盲目复制的):

LogisticRegression()
QDA()
GridSearchCV(SVR(kernel='rbf', gamma=0.1), cv=5,
                   param_grid={"C": [1e0, 1e1, 1e2, 1e3],
                               "gamma": np.logspace(-2, 2, 5)})
GridSearchCV(KernelRidge(kernel='rbf', gamma=0.1), cv=5,
                  param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3],
                              "gamma": np.logspace(-2, 2, 5)})
GradientBoostingRegressor(loss='quantile', alpha=0.95,
                                n_estimators=250, max_depth=3,
                                learning_rate=.1, min_samples_leaf=9,
                                min_samples_split=9)
DecisionTreeRegressor(max_depth=4)
AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
                          n_estimators=300, random_state=rng)
RandomForestRegressor(n_estimators=10, min_samples_split=2, n_jobs=-1)

The results fall into two different categories of failure: 结果分为两种不同的故障类别:

  1. The time field is having no effect, probably due to the absence of correlation from the oscillatory behaviour of the target variable. 时间字段没有影响,可能是由于目标变量的振荡行为没有相关性。 However, secondary effects from other variables allow a modest predictive capability for future time ranges (these other variables have a simple correlation with the target variable) 但是,来自其他变量的次级影响允许对未来时间范围进行适度的预测(这些其他变量与目标变量具有简单的相关性)
  2. The when applying predict() to the training time range the prediction is near perfect with respect to the observations, but when given the future time range (for which data was not used in training) the predicted value stays constant. 当对训练时间范围应用predict()时,预测相对于观察值接近完美,但是当给定将来的时间范围(训练中未使用数据)时,预测值保持恒定。

Below is how I performed the training and testing: 以下是我进行培训和测试的方式:

weather_df.index = pd.to_datetime(weather_df.index,unit='D')
weather_df['Days'] = (weather_df.index-datetime.datetime(2005,1,1)).days
ts = pd.DataFrame({'Temperature':weather_df['Mean TemperatureC'].ix[:'2015-1-1'],
                   'Humidity':weather_df[' Mean Humidity'].ix[:'2015-1-1'],
                   'Visibility':weather_df[' Mean VisibilityKm'].ix[:'2015-1-1'],
                   'Wind':weather_df[' Mean Wind SpeedKm/h'].ix[:'2015-1-1'],
                   'Time':weather_df['Days'].ix[:'2015-1-1'] 
                   })
start_test = datetime.datetime(2012,1,1)
ts_train = ts[ts.index < start_test]
ts_test = ts
data_train = np.array(ts_train.Humidity, ts_test.Time)[np.newaxis]
data_target = np.array(ts_train.Temperature)[np.newaxis].ravel()
model.fit(data_train.T, data_target.T)
data_test = np.array(ts_test.Humidity, ts_test.Time)[np.newaxis]
pred = model.predict(data_test.T)
ts_test['Pred'] = pred

Is there a regression model I could/should use for this problem, and if so what would be appropriate options and parameters? 我是否可以/应该使用一个回归模型来解决这个问题,如果可以,什么是合适的选项和参数?

(Also, my treatment of the time objects in sklearn is far from elegant, so I am gladly taking advice there.) (此外,我对sklearn中的时间对象的处理远非优雅,因此我很乐意在此接受建议。)

Here is my guess about what is happening in your two types of results: 这是我对两种类型的结果中发生的情况的猜测:

.days does not convert your index into a form that repeats itself between your train and test samples. .days不会将索引转换为在火车和测试样本之间重复的形式。 So it becomes a unique value for every date in your dataset. 因此,它对于数据集中的每个日期成为唯一值。

As a consequence your models either ignore days (1st result), or your model overfits on the days feature (2nd result) causing the model to perform badly on your test data. 结果,您的模型要么忽略了days (第一个结果),要么模型对days功能过度拟合(第二个结果),导致模型在测试数据上的表现不佳。

Suggestion: 建议:

If your dataset is large enough (it looks like it goes from 2005), try using dayofyear or weekofyear instead, so that your model will have something generalizable from the date information. 如果您的数据集足够大(看起来好像是从2005年开始的),请尝试改用dayofyearweekofyear ,以便您的模型可以从日期信息中得到一些概括。

Agree with @zemekeneng that time should be module by the corresponding periods like 24hours, 12 months etc. 同意@zemekeneng认为时间应以相应的时间段为单位,例如24小时,12个月等。

Beyond that, I'd like to remind using prior knowledge when selecting features or models. 除此之外,我想提醒您在选择功能部件或模型时要使用先验知识。 Since you already knew that your data is highly likely to follow sin(x), it should be used even in data driven approach. 由于您已经知道数据极有可能遵循sin(x),因此即使在数据驱动的方法中也应使用它。

We know that sin(x) can be approximated by x - x^3/3! + x^5/5! - x^7/7! 我们知道sin(x)可以近似为x - x^3/3! + x^5/5! - x^7/7! x - x^3/3! + x^5/5! - x^7/7! then these should be used as features. 那么这些应该用作功能。 None of the models that you used may have included these features. 您使用的所有型号均未包含这些功能。 One way to do it would be to create these high order features by yourself and concatenate to your other features. 一种方法是自己创建这些高阶特征并将其与其他特征串联。 Then a linear model with regulation may give you reasonable results. 然后,具有调节作用的线性模型可以为您提供合理的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM