简体   繁体   English

python 使用协方差最小的高斯混合模型(GMM)拟合加权数据

[英]python Fitting weighted data with Gaussian mixture model (GMM) with minimum on covariance

I want to fit a Gaussian mixture model to a set of weighted data points using python.我想使用 python 将高斯混合模型拟合到一组加权数据点。

I tried sklearn.mixture.GMM() which works fine except for the fact that it weights all data points equally.我试过 sklearn.mixture.GMM() ,它工作正常,除了它对所有数据点加权相等。 Does anyone know a way to assign weights to the data points in this method?有谁知道在这种方法中为数据点分配权重的方法? I tried using data points several times to "increase their weight", but this seems ineffective for large datasets.我多次尝试使用数据点来“增加它们的权重”,但这对于大型数据集似乎无效。

I also thought about implementing the EM algorithm myself, but this seems to be much slower than eg the GMM method above and would extremely increase the computation time for large datasets.我也考虑过自己实现 EM 算法,但这似乎比上面的 GMM 方法慢得多,并且会极大地增加大型数据集的计算时间。

I just discovered the opencv method for the EM algorithm cv2.EM().我刚刚发现了 EM 算法 cv2.EM() 的 opencv 方法。 This again works fine but has the same problem as sklearn.mixture.GMM and additionally, there seems no way to change the minimum of the values allowed for the covariance.这再次工作正常,但与 sklearn.mixture.GMM 存在相同的问题,此外,似乎无法更改协方差允许的最小值。 Or is there a way to change the covariance minimum to eg 0.001?或者有没有办法将协方差最小值更改为例如 0.001? I hoped that it would be possible to use the probe parameter to assign the weights to the data, but this seems to be just an output parameter and has no influence on the fitting process, doesn't it?我希望可以使用探针参数为数据分配权重,但这似乎只是一个输出参数,对拟合过程没有影响,不是吗? Using probs0 and start the algorithm with the M step by using trainM didn't help either.使用 probs0 并通过使用 trainM 以 M 步启动算法也无济于事。 For probs0 I used a (number of datapoint) x (number of GMM components) matrix whose columns are identical while the weighting parameters for the data points are written to the row corresponding to the data point.对于 probs0,我使用了(数据点数)x(GMM 分量数)矩阵,其列相同,而数据点的权重参数写入与数据点对应的行。 This didn't solve the problem either.这也没有解决问题。 It just resulted in a mixture model where all means where 0.它只是产生了一个混合模型,其中所有的意思都是 0。

Has anyone an idea how to manipulate the methods above or does anyone know another method so that the GMM can be fitted with weighted data?有没有人知道如何操作上述方法,或者有没有人知道另一种方法,以便 GMM 可以拟合加权数据?

Thanks, Jane谢谢,简

If you're still looking for a solution, pomegranate now supports training GMM on weighted data.如果您仍在寻找解决方案,pomegranate 现在支持在加权数据上训练 GMM。 All you need to do is pass in a vector of weights at training time and it'll handle it for you.您需要做的就是在训练时传入一个权重向量,它会为您处理。 Here is a short tutorial on GMMs in pomegranate!这是关于石榴中 GMM 的简短教程!

The parent github is here:父 github 在这里:

https://github.com/jmschrei/pomegranate https://github.com/jmschrei/pomegranate

The specific tutorial is here:具体教程在这里:

https://github.com/jmschrei/pomegranate/blob/master/tutorials/B_Model_Tutorial_2_General_Mixture_Models.ipynb https://github.com/jmschrei/pomegranate/blob/master/tutorials/B_Model_Tutorial_2_General_Mixture_Models.ipynb

Taking Jacobs suggestion, I coded up a pomegranate implementation example:根据 Jacobs 的建议,我编写了一个 pomegranate 实现示例:

import pomegranate
import numpy
import sklearn
import sklearn.datasets 

#-------------------------------------------------------------------------------
#Get data from somewhere (moons data is nice for examples)
Xmoon, ymoon = sklearn.datasets.make_moons(200, shuffle = False, noise=.05, random_state=0)
Moon1 = Xmoon[:100] 
Moon2 = Xmoon[100:] 
MoonsDataSet = Xmoon

#Weight the data from moon2 much higher than moon1:
MoonWeights = numpy.array([numpy.ones(100), numpy.ones(100)*10]).flatten()

#Make the GMM model using pomegranate
model = pomegranate.gmm.GeneralMixtureModel.from_samples(
    pomegranate.MultivariateGaussianDistribution,   #Either single function, or list of functions
    n_components=6,     #Required if single function passed as first arg
    X=MoonsDataSet,     #data format: each row is a point-coordinate, each column is a dimension
    )

#Force the model to train again, using additional fitting parameters
model.fit(
    X=MoonsDataSet,         #data format: each row is a coordinate, each column is a dimension
    weights = MoonWeights,  #List of weights. One for each point-coordinate
    stop_threshold = .001,  #Lower this value to get better fit but take longer. 
                            #   (sklearn likes better/slower fits than pomegrante by default)
    )

#Wrap the model object into a probability density python function 
#   f(x_vector)
def GaussianMixtureModelFunction(Point):
    return model.probability(numpy.atleast_2d( numpy.array(Point) ))

#Plug in a single point to the mixture model and get back a value:
ExampleProbability = GaussianMixtureModelFunction( numpy.array([ 0,0 ]) )
print ('ExampleProbability', ExampleProbability)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM