简体   繁体   中英

Fit mixture of Gaussians with fixed covariance in Python

I have some 2D data (GPS data) with clusters (stop locations) that I know resemble Gaussians with a characteristic standard deviation (proportional to the inherent noise of GPS samples). The figure below visualizes a sample that I expect has two such clusters. The image is 25 meters wide and 13 meters tall.

在此输入图像描述

The sklearn module has a function sklearn.mixture.GaussianMixture which allows you to fit a mixture of Gaussians to data. The function has a parameter, covariance_type , that enables you to assume different things about the shape of the Gaussians. You can, for example, assume them to be uniform using the 'tied' argument.

However, it does not appear directly possible to assume the covariance matrices to remain constant. From the sklearn source code it seems trivial to make a modification that enables this but it feels a bit excessive to make a pull request with an update that allows this (also I don't want to accidentally add bugs in sklearn ). Is there a better way to fit a mixture to data where the covariance matrix of each Gaussian is fixed?

I want to assume that the SD should remain constant at around 3 meters for each component, since that is roughly the noise level of my GPS samples.

I think the best option would be to "roll your own" GMM model by defining a new scikit-learn class that inherits from GaussianMixture and overwrites the methods to get the behavior you want. This way you just have an implementation yourself and you don't have to change the scikit-learn code (and create a pull-request).

Another option that might work is to look at the Bayesian version of GMM in scikit-learn. You might be able to set the prior for the covariance matrix so that the covariance is fixed. It seems to use the Wishart distribution as a prior for the covariance. However I'm not familiar enough with this distribution to help you out more.

It is simple enough to write your own implementation of EM algorithm . It would also give you a good intuition of the process. I assume that covariance is known and that prior probabilities of components are equal, and fit only means.

The class would look like this (in Python 3):

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal

class FixedCovMixture:
    """ The model to estimate gaussian mixture with fixed covariance matrix. """
    def __init__(self, n_components, cov, max_iter=100, random_state=None, tol=1e-10):
        self.n_components = n_components
        self.cov = cov
        self.random_state = random_state
        self.max_iter = max_iter
        self.tol=tol

    def fit(self, X):
        # initialize the process:
        np.random.seed(self.random_state)
        n_obs, n_features = X.shape
        self.mean_ = X[np.random.choice(n_obs, size=self.n_components)]
        # make EM loop until convergence
        i = 0
        for i in range(self.max_iter):
            new_centers = self.updated_centers(X)
            if np.sum(np.abs(new_centers-self.mean_)) < self.tol:
                break
            else:
                self.mean_ = new_centers
        self.n_iter_ = i

    def updated_centers(self, X):
        """ A single iteration """
        # E-step: estimate probability of each cluster given cluster centers
        cluster_posterior = self.predict_proba(X)
        # M-step: update cluster centers as weighted average of observations
        weights = (cluster_posterior.T / cluster_posterior.sum(axis=1)).T
        new_centers = np.dot(weights, X)
        return new_centers


    def predict_proba(self, X):
        likelihood = np.stack([multivariate_normal.pdf(X, mean=center, cov=self.cov) 
                               for center in self.mean_])
        cluster_posterior = (likelihood / likelihood.sum(axis=0))
        return cluster_posterior

    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=0)

On the data like yours, the model would converge quickly:

np.random.seed(1)
X = np.random.normal(size=(100,2), scale=3)
X[50:] += (10, 5)

model = FixedCovMixture(2, cov=[[3,0],[0,3]], random_state=1)
model.fit(X)
print(model.n_iter_, 'iterations')
print(model.mean_)

plt.scatter(X[:,0], X[:,1], s=10, c=model.predict(X))
plt.scatter(model.mean_[:,0], model.mean_[:,1], s=100, c='k')
plt.axis('equal')
plt.show();

and output

11 iterations
[[9.92301067 4.62282807]
 [0.09413883 0.03527411]]

You can see that the estimated centers ( (9.9, 4.6) and (0.09, 0.03) ) are close to the true centers ( (10, 5) and (0, 0) ).

在此输入图像描述

First, you can use spherical option, which will give you single variance value for each component. This way you can check yourself, and if the received values of variance are too different then something went wrong.

In a case you want to preset the variance, you problem degenerates to finding only best centers for your components. You can do it by using k-means , for example. If you don't know the number of the components, you may sweep over all logical values (like 1 to 20) and evaluate the decrement in fitting error. Or you can optimize your own EM function, to find the centers and the number of components simultaneously.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM