简体繁体 English

时间序列中的分类模式

[英]Classifying pattern in time series

原文 2020-10-18 02:53:50 2 2 machine-learning/ time-series/ classification/ logistic-regression/ pattern-recognition

I am dealing with a repeating pattern in time series data.我正在处理时间序列数据中的重复模式。 My goal is to classify every pattern as 1, and anything that does not follow the pattern as 0. The pattern repeats itself between every two peaks as shown below in the image.我的目标是将每个模式归类为 1，任何不遵循该模式的都归为 0。该模式在每两个峰值之间重复，如下图所示。

The patterns are not necessarily fixed in sample size but stay within approximate sample size, let's say 500samples +-10%.这些模式的样本量不一定是固定的，而是保持在近似的样本量内，比如 500 个样本 +-10%。 The heights of the peaks can change.峰的高度可以改变。 The random signal (I called it random, but basically it means not following pattern shape) can also change in value.随机信号（我称之为随机，但基本上它意味着不遵循模式形状）也可以改变值。

The data is from a sensor.数据来自传感器。 Patterns are when the device is working smoothly.模式是设备运行顺利的时候。 If the device is malfunctioning, then I will not see the patterns and will get something similar to the class 0 I have shown in the image.如果设备出现故障，那么我将看不到这些模式，并且会得到类似于我在图像中显示的 0 类的东西。

What I have done so far is building a logistic regression model.到目前为止，我所做的是构建逻辑回归模型。 Here are my steps for data preparation:以下是我的数据准备步骤：

Grab data between every two consecutive peaks, resample it to a fixed size of 100 samples, scale data to [0-1].在每两个连续峰值之间抓取数据，将其重新采样为 100 个样本的固定大小，将数据缩放为 [0-1]。 This is class 1.这是1级。
Repeated step 1 on data between valley and called it class 0.对山谷之间的数据重复步骤 1，并将其称为 0 类。
I generated some noise, and repeated step 1 on chunk of 500 samples to build extra class 0 data.我产生了一些噪音，并在 500 个样本块上重复步骤 1 以构建额外的 0 类数据。

Bottom figure shows my predictions on the test dataset.下图显示了我对测试数据集的预测。 Prediction on the noise chunk is not great.对噪声块的预测不是很好。 I am worried in the real data I may get even more false positives.我担心在真实数据中我可能会得到更多误报。 Any idea on how I can improve my predictions?关于如何改进我的预测的任何想法？ Any better approach when there is no class 0 data available?当没有可用的 0 类数据时，有什么更好的方法吗？

I have seen similar question here .我在这里看到过类似的问题。 My understanding of Hidden Markov Model is limited but I believe it's used to predict future data.我对隐马尔可夫模型的理解是有限的，但我相信它用于预测未来的数据。 My goal is to classify a sliding window of 500 sample throughout my data.我的目标是在我的数据中对 500 个样本的滑动窗口进行分类。

2 个解决方案

I have some proposals, that you could try out.我有一些建议，你可以试试。 First, I think in this field often recurrent neural networks are used (eg LSTMs).首先，我认为在这个领域经常使用循环神经网络（例如 LSTM）。 But I also heard that some people also work with tree based method like light gbm (I think Aileen Nielsen uses this approach).但我也听说有些人也使用基于树的方法，例如 light gbm（我认为 Aileen Nielsen 使用这种方法）。

So if you don't want to dive into neural networks, which is probably not necessary, because your signals seem to be distinguishable relative easily, you can give light gbm (or other tree ensamble methods) a chance.因此，如果您不想深入研究神经网络，这可能没有必要，因为您的信号似乎相对容易区分，您可以给 light gbm（或其他树集成方法）一个机会。

If you know the maximum length of a positive sample, you can define the length of your "sliding sample-window" that becomes your input vector (so each sample in the sliding window becomes one input feature), then I would add an extra attribute with the number of samples when the last peak occured (outside/before the sample window).如果您知道正样本的最大长度，则可以定义成为输入向量的“滑动样本窗口”的长度（因此滑动窗口中的每个样本都成为一个输入特征），然后我会添加一个额外的属性最后一个峰值出现时的样本数（样本窗口之外/之前）。 Then you can check in how many steps you let your window slide over the data.然后，您可以检查让窗口在数据上滑动的步骤数。 This also depends on the memory you have available for this.这也取决于您可用于此的内存。 But maybe it would be wise then to skip some of the windows between a change between positive and negative, because the states might not be classifiable unambiguously.但也许明智的做法是跳过正负变化之间的一些窗口，因为状态可能无法明确分类。

In case memory becomes an issue, neural networks could be the better choice, because for training they do not need all training data available at once, so you can generate your input data in batches.如果内存成为问题，神经网络可能是更好的选择，因为对于训练，它们不需要一次性提供所有训练数据，因此您可以批量生成输入数据。 With tree based methods this possible does not exist or only in a very limited way.使用基于树的方法，这种可能性不存在或仅以非常有限的方式存在。

I'm not sure of what you are trying to achieve.我不确定你想要达到的目标。

If you want to characterize what is a peak or not - which is an after the facts classification - then you can use a simple rule to define peaks such as signal(t) - average(signal, tN to t) > T , with T a certain threshold and N a number of data points to look backwards to.如果您想表征什么是峰值- 这是事后分类 - 那么您可以使用一个简单的规则来定义峰值，例如signal(t) - average(signal, tN to t) > T ，其中T某个阈值和N个要回顾的数据点。

This would qualify what is a peak (class 1) and what is not (class 0), hence does a classification of patterns.这将限定什么是峰值（1 类），什么不是（0 类），因此可以对模式进行分类。

If your goal is to predict that a peak is going to happen few time units before the peak (on time t), using say data from t-n1 to t-n2 as features, then logistic regression might not necessarily be the best choice.如果您的目标是预测峰值将在峰值之前的几个时间单位（时间 t）发生，使用从t-n1到t-n2作为特征，那么逻辑回归可能不一定是最佳选择。

To find the right model you have to start with visualizing the features you have from t-n1 to t-n2 for every peak(t) and see if there is any pattern you can find.要找到正确的模型，您必须首先将每个peak(t)从t-n1到t-n2的特征可视化，然后查看是否可以找到任何模式。 And it can be anything:它可以是任何东西：

was there a peak in in the n3 days before t ?在 t 之前的n3天有峰值吗？
is there a trend ?有趋势吗？
was there an outlier (transform your data into exponential)是否有异常值（将您的数据转换为指数）

in order to compare these patterns, think of normalizing them so that the n2-n1 data points go from 0 to 1 for example.为了比较这些模式，可以考虑对它们进行归一化，例如， n2-n1数据点从 0 到 1。

If you find a pattern visually then you will know what kind of model is likely to work, on which features.如果你在视觉上找到一个模式，那么你就会知道什么样的模型可能适用，哪些特征。

If you don't then it's likely that the white noise you added will be as good.如果您不这样做，那么您添加的白噪声可能会同样好。 so you might not find a good prediction model.所以你可能找不到好的预测模型。

However, your bottom graph is not so bad;但是，您的底部图表还不错； you have only 2 major false positives out of >15 predictions.在 >15 个预测中，您只有 2 个主要的误报。 This hints at better feature engineering.这暗示了更好的特征工程。