简体繁体 English

用于不均匀间隔的顺序分类数据的无监督聚类算法？

[英]Unsupervised clustering algorithm for unevenly spaced sequential categorical data?

原文 2023-01-29 15:17:51 0 1 python/ r/ statistics/ time-series/ cluster-analysis

I am looking for a technique/method/algorithm which will be able to handle time-dependent data (each sample has 20 time steps, but for the most part they occur unevenly between samples, ie, one sample may have a value at 0.4 seconds while another sample might not).我正在寻找一种能够处理时间相关数据的技术/方法/算法（每个样本有 20 个时间步长，但在大多数情况下它们在样本之间出现不均匀，即一个样本的值可能为 0.4 秒而另一个样本可能不会）。 The value itself of the time step corresponds to a categorical position on the body (ranging from 1-20) where the muscle activiation occured.时间步长的值本身对应于发生肌肉激活的身体上的分类 position（范围从 1-20）。 So the data resembles, (time, position): (0.1, 16) (0.16, 1) (0.25, 13) (0.26, 12) (0.27, 1) (0.4, 4)所以数据类似于，（时间，位置）：（0.1，16）（0.16，1）（0.25，13）（0.26，12）（0.27，1）（0.4，4）

Is there a clustering algorithm which will be able to work for this type of data.是否有一种聚类算法可以处理这种类型的数据。 I would like the algorithm to consider the time dependency of the data.我希望算法考虑数据的时间依赖性。 Dynamic time warping is not suitable for unevenly spaced time series data and I am not sure how it would handle the sparse categorical data I have, eg a given position will only appear once per sample.动态时间扭曲不适用于间隔不均匀的时间序列数据，我不确定它将如何处理我拥有的稀疏分类数据，例如给定的 position 每个样本只会出现一次。

Any suggestions or help is appreciated.任何建议或帮助表示赞赏。

I have looked through lots of different models, but none so far work with their given assumptions.我查看了许多不同的模型，但到目前为止没有一个模型符合他们给定的假设。 Hidden markov models are out of the question (need stochastic time steps), DTW does not work for unevenly spaced time steps, and techniques like Lomb-Scargle do not work for categorical data especially not-periodic categorical data.隐马尔可夫模型是不可能的（需要随机时间步长），DTW 不适用于间隔不均匀的时间步长，并且 Lomb-Scargle 等技术不适用于分类数据，尤其是非周期性分类数据。 Fast-fourier transform is also off the table.快速傅立叶变换也不在考虑之列。

1 个解决方案

One method you can use for clustering this type of time-dependent data is a Hidden Markov Model (HMM).可用于对此类时间相关数据进行聚类的一种方法是隐马尔可夫 Model (HMM)。 HMMs can model the dependencies between the positions and the time steps, allowing for the clustering of similar patterns in the data. HMM 可以 model 位置和时间步长之间的依赖关系，允许对数据中的相似模式进行聚类。 Another alternative is a Gaussian Mixture Model (GMM), where you can model the position and time values as multivariate Gaussian distributions, and use Expectation-Maximization (EM) to estimate the parameters of the distributions.另一种选择是高斯混合 Model (GMM)，您可以在其中将 model position 和时间值作为多元高斯分布，并使用期望最大化 (EM) 来估计分布的参数。 Both HMMs and GMMs have been used in various time-series analysis and clustering tasks, and both have Python implementations available through popular libraries such as scikit-learn and hmmlearn. HMM 和 GMM 都已用于各种时间序列分析和聚类任务，并且都可以通过流行的库（例如 scikit-learn 和 hmmlearn）获得 Python 的实现。

It is recommended to try out both algorithms and compare the results to see which one performs better for your specific dataset.建议尝试这两种算法并比较结果，看看哪种算法对您的特定数据集表现更好。 You can also experiment with different features and preprocessing techniques, such as interpolation or downsampling, to see if it improves the performance of the clustering algorithm.您还可以尝试不同的特征和预处理技术，例如插值或下采样，看看它是否提高了聚类算法的性能。