简体   繁体   English

在Python中使用具有固定协方差的高斯混合

[英]Fit mixture of Gaussians with fixed covariance in Python

I have some 2D data (GPS data) with clusters (stop locations) that I know resemble Gaussians with a characteristic standard deviation (proportional to the inherent noise of GPS samples). 我有一些2D数据(GPS数据)与簇(停止位置),我知道它类似于具有特征标准偏差的高斯(与GPS样本的固有噪声成比例)。 The figure below visualizes a sample that I expect has two such clusters. 下图显示了我期望有两个这样的聚类的样本。 The image is 25 meters wide and 13 meters tall. 图像宽25米,高13米。

在此输入图像描述

The sklearn module has a function sklearn.mixture.GaussianMixture which allows you to fit a mixture of Gaussians to data. sklearn模块有一个函数sklearn.mixture.GaussianMixture ,它允许您将高斯混合物拟合到数据中。 The function has a parameter, covariance_type , that enables you to assume different things about the shape of the Gaussians. 该函数有一个参数covariance_type ,使您可以假设有关高斯形状的不同内容。 You can, for example, assume them to be uniform using the 'tied' argument. 例如,您可以使用'tied'参数假设它们是统一的。

However, it does not appear directly possible to assume the covariance matrices to remain constant. 但是,似乎不能直接假设协方差矩阵保持不变。 From the sklearn source code it seems trivial to make a modification that enables this but it feels a bit excessive to make a pull request with an update that allows this (also I don't want to accidentally add bugs in sklearn ). sklearn源代码中进行修改似乎是微不足道的,但是使用允许此更新的拉取请求感觉有点过分(我也不想在sklearn意外添加错误)。 Is there a better way to fit a mixture to data where the covariance matrix of each Gaussian is fixed? 是否有更好的方法将混合拟合到每个高斯的协方差矩阵固定的数据?

I want to assume that the SD should remain constant at around 3 meters for each component, since that is roughly the noise level of my GPS samples. 我想假设SD应该保持恒定在每个组件大约3米,因为这大致是我的GPS样本的噪音水平。

I think the best option would be to "roll your own" GMM model by defining a new scikit-learn class that inherits from GaussianMixture and overwrites the methods to get the behavior you want. 我认为最好的选择是通过定义一个新的scikit-learn类来“滚动你自己的” GMM模型,该类继承自GaussianMixture并覆盖方法以获得你想要的行为。 This way you just have an implementation yourself and you don't have to change the scikit-learn code (and create a pull-request). 这样您就可以自己实现一个实现,而不必更改scikit-learn代码(并创建一个pull-request)。

Another option that might work is to look at the Bayesian version of GMM in scikit-learn. 可能有用的另一个选择是在scikit-learn中查看GMM贝叶斯版本 You might be able to set the prior for the covariance matrix so that the covariance is fixed. 您可能能够为协方差矩阵设置先验,以便协方差是固定的。 It seems to use the Wishart distribution as a prior for the covariance. 它似乎使用Wishart分布作为协方差的先验。 However I'm not familiar enough with this distribution to help you out more. 但是我对这个发行版不太熟悉,无法帮助你。

It is simple enough to write your own implementation of EM algorithm . 编写自己的EM算法实现非常简单。 It would also give you a good intuition of the process. 它也会让你对这个过程有一个很好的直觉。 I assume that covariance is known and that prior probabilities of components are equal, and fit only means. 我假设协方差是已知的,并且组件的先验概率是相等的,并且仅适合于均值。

The class would look like this (in Python 3): 该类看起来像这样(在Python 3中):

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal

class FixedCovMixture:
    """ The model to estimate gaussian mixture with fixed covariance matrix. """
    def __init__(self, n_components, cov, max_iter=100, random_state=None, tol=1e-10):
        self.n_components = n_components
        self.cov = cov
        self.random_state = random_state
        self.max_iter = max_iter
        self.tol=tol

    def fit(self, X):
        # initialize the process:
        np.random.seed(self.random_state)
        n_obs, n_features = X.shape
        self.mean_ = X[np.random.choice(n_obs, size=self.n_components)]
        # make EM loop until convergence
        i = 0
        for i in range(self.max_iter):
            new_centers = self.updated_centers(X)
            if np.sum(np.abs(new_centers-self.mean_)) < self.tol:
                break
            else:
                self.mean_ = new_centers
        self.n_iter_ = i

    def updated_centers(self, X):
        """ A single iteration """
        # E-step: estimate probability of each cluster given cluster centers
        cluster_posterior = self.predict_proba(X)
        # M-step: update cluster centers as weighted average of observations
        weights = (cluster_posterior.T / cluster_posterior.sum(axis=1)).T
        new_centers = np.dot(weights, X)
        return new_centers


    def predict_proba(self, X):
        likelihood = np.stack([multivariate_normal.pdf(X, mean=center, cov=self.cov) 
                               for center in self.mean_])
        cluster_posterior = (likelihood / likelihood.sum(axis=0))
        return cluster_posterior

    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=0)

On the data like yours, the model would converge quickly: 在像您这样的数据上,模型会快速收敛:

np.random.seed(1)
X = np.random.normal(size=(100,2), scale=3)
X[50:] += (10, 5)

model = FixedCovMixture(2, cov=[[3,0],[0,3]], random_state=1)
model.fit(X)
print(model.n_iter_, 'iterations')
print(model.mean_)

plt.scatter(X[:,0], X[:,1], s=10, c=model.predict(X))
plt.scatter(model.mean_[:,0], model.mean_[:,1], s=100, c='k')
plt.axis('equal')
plt.show();

and output 和输出

11 iterations
[[9.92301067 4.62282807]
 [0.09413883 0.03527411]]

You can see that the estimated centers ( (9.9, 4.6) and (0.09, 0.03) ) are close to the true centers ( (10, 5) and (0, 0) ). 您可以看到估计的中心( (9.9, 4.6)(0.09, 0.03) )接近真实的中心( (10, 5)(0, 0) )。

在此输入图像描述

First, you can use spherical option, which will give you single variance value for each component. 首先,您可以使用spherical选项,它将为每个组件提供单个方差值。 This way you can check yourself, and if the received values of variance are too different then something went wrong. 通过这种方式,您可以检查自己,如果收到的差异值太大,那么就会出现问题。

In a case you want to preset the variance, you problem degenerates to finding only best centers for your components. 如果您想预设差异,则问题会退化为仅为您的组件找到最佳中心。 You can do it by using k-means , for example. 例如,您可以使用k-means来完成。 If you don't know the number of the components, you may sweep over all logical values (like 1 to 20) and evaluate the decrement in fitting error. 如果您不知道组件的数量,则可以扫描所有逻辑值(例如1到20)并评估拟合错误的减量。 Or you can optimize your own EM function, to find the centers and the number of components simultaneously. 或者,您可以优化自己的EM功能,同时查找中心和组件数量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM