简体繁体 English

使用Scikit学习管道，当要素依赖于其他行时，如何从时间序列数据中生成要素？

[英]Using Scikit-learn Pipelines, how can features be generated from time series data when the features depend on other rows?

原文 2018-03-13 14:59:46 4 1 python/ scikit-learn/ time-series/ pipeline

Here is a concrete example scenario: 这是一个具体的示例场景：

I want to train a classifier to predict whether a given stock will go up or down in price the next day. 我想训练一个分类器来预测给定股票第二天的价格会上涨还是下跌。

Here is how I want to do that: 这是我要这样做的方式：

I have the daily close price of a particular stock over a given period of time. 我具有给定时间内特定股票的每日收盘价。 I want to generate features including the exponential weighted moving average and price rate of change . 我想生成包括指数加权移动平均值和价格变化率在内的特征。 These calculations for a given day require the close prices of previous days. 给定日期的这些计算需要前几天的收盘价。 Then I want to calculate a target variable 1 or -1 indicating if the stock price went up or down the next day. 然后，我要计算一个目标变量1或-1，以指示第二天股价是上涨还是下跌。

After generating the features and target, I want to split the data into train/test (or even train/validation/test) groups then train and test a classifier to predict the target. 生成特征和目标后，我想将数据分为训练/测试（甚至训练/验证/测试）组，然后训练和测试分类器以预测目标。

Finally, I want to implement and execute these steps in an sklearn Pipeline for two main reasons: 1.) to easily manipulate the data flow and/or try different classifiers and 2.) to run a grid search to find good parameters to use in both the feature generation steps and classifier -- eg how many days should be taken into account to calculate the exponential weighted moving average or how many estimators should be used in the random forest classifier? 最后，我想在sklearn管道中实现和执行这些步骤，主要有两个原因：1.）轻松操纵数据流和/或尝试使用不同的分类器，以及2.）运行网格搜索以查找可用于以下操作的良好参数特征生成步骤和分类器–例如，应该考虑多少天来计算指数加权移动平均值，或者在随机森林分类器中应使用多少个估计量？

Here is the issue I run into: 这是我遇到的问题：

From what I've read about sklearn Pipelines, I would need to create custom transformations and perhaps use FeatureUnion to generate the features. 根据我对sklearn Pipelines的了解，我需要创建自定义转换，并可能使用FeatureUnion生成功能。 However, the examples I've seen all call .fit(X_train, y_train) , which runs each step in the Pipeline (including generating the features). 但是，我看过的所有示例都调用.fit(X_train, y_train) ，它运行管道中的每个步骤（包括生成.fit(X_train, y_train) ）。 But my features depend on other rows which may not be in X_train. 但是我的功能取决于X_train中可能没有的其他行。

1 个解决方案

Have a look at the documentation for tsfresh . 看看tsfresh的文档。 They even have an example for what you are looking for: using time series transforms with sklearn pipelines' 他们甚至提供了您要查找的示例：将时间序列转换与sklearn管道的