简体   繁体   English

如何正确删除Scikit-Learn的DPGMM的冗余组件?

[英]How to properly remove redundant components for Scikit-Learn's DPGMM?

I am using scikit-learn to implement the Dirichlet Process Gaussian Mixture Model: 我正在使用scikit-learn来实现Dirichlet过程高斯混合模型:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/dpgmm.py http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/dpgmm.py http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html

That is, it is sklearn.mixture.BayesianGaussianMixture() with default set to weight_concentration_prior_type = 'dirichlet_process' . 也就是说,它是sklearn.mixture.BayesianGaussianMixture() ,默认设置为weight_concentration_prior_type = 'dirichlet_process' As opposed to k-means, where users set the number of clusters "k" a priori, DPGMM is an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters. 与k-means(用户将先验将簇数设置为“ k”)相反,DPGMM是无限混合模型,其中Dirichlet过程是对簇数的先验分布。

My DPGMM model consistently outputs the exact number of clusters as n_components . 我的DPGMM模型始终将确切的簇数输出为n_components As discussed here, the correct way to deal with this is to "reduce redundant components" with predict(X) : 如此处讨论的那样,解决此问题的正确方法是使用predict(X) “减少冗余组件”:

Scikit-Learn's DPGMM fitting: number of components? Scikit-Learn的DPGMM配件:零件数量?

However, the example linked to does not actually remove redundant components and show the "correct" number of clusters in the data. 但是,链接到的示例实际上并未删除多余的组件,并在数据中显示了“正确”数量的群集。 Rather, it simply plots the correct number of clusters. 相反,它只是绘制正确数量的群集。

http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html

How do users actually remove the redundant components, and output an array which should these components? 用户实际上如何删除冗余组件,并输出一个应由这些组件组成的阵列? Is this the "official"/only way to remove redundant clusters? 这是删除冗余群集的“官方” /唯一方法吗?

Here is my code: 这是我的代码:

>>> import pandas as pd 
>>> import numpy as np 
>>> import random
>>> from sklearn import mixture  
>>> X = pd.read_csv(....)   # my matrix
>>> X.shape
(20000, 48) 
>>> dpgmm3 = mixture.BayesianGaussianMixture(n_components = 20, weight_concentration_prior_type='dirichlet_process', max_iter = 1000, verbose = 2) 
>>> dpgmm3.fit(X) # Fitting the DPGMM model
>>> labels = dpgmm3.predict(X) # Generating labels after model is fitted
>>> max(labels)
>>> np.unique(labels) #Number of lab els == n_components specified above
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

#Trying with a different n_components

>>> dpgmm3_1 = mixture.BayesianGaussianMixture( weight_concentration_prior_type='dirichlet_process', max_iter = 1000) #not specifying n_components
>>> dpgmm3_1.fit(X)
>>> labels_1 = dpgmm3_1.predict(X)  
>>> labels_1
array([0, 0, 0, ..., 0, 0, 0]) #All were classified under the same label

#Trying with n_components = 7

>>> dpgmm3_2 = mixture.BayesianGaussianMixture(n_components = 7, weight_concentration_prior_type='dirichlet_process', max_iter = 1000)
>>> dpgmm3_2.fit()

>>> labels_2 = dpgmm3_2.predict(X)
>>> np.unique(labels_2)
array([0, 1, 2, 3, 4, 5, 6]) #number of labels == n_components

There is no automated method to do so yet but you can have a look at the estimated weights_ attribute and prune components that have a small value (eg below 0.01). 尚无自动方法可以执行此操作,但是您可以查看估计的weights_属性和具有较小值(例如,小于0.01)的修剪工具。

Edit : yo count the number of components effectively used by the model you can do: 编辑 :您可以计算模型可以有效使用的组件数量:

model = BayesianGaussianMixture(n_components=30).fit(X)
print("active components: %d" % np.sum(model.weights_ > 0.01)

This should print a number of active components lower than the provided upper bound (30 in this example). 这将打印一些低于提供的上限(在此示例中为30)的活动组件。

Edit 2 : the n_components parameter specifies the maximum number of components the model can use. 编辑2n_components参数指定模型可以使用的最大组件数。 The effective number of components actually used by the model can be retrieved by introspecting the weigths_ attribute at the end of the fit. 可以通过在拟合结束时内省weigths_属性来检索模型实际使用的组件的有效数量。 It will mostly depend on the structure of the data and on the value of weight_concentration_prior (especially if the number of samples is small). 它主要取决于数据的结构和weight_concentration_prior的值(尤其是样本数较少时)。

Check out repulsive gaussian mixtures described in [1]. 检查[1]中描述的排斥性高斯混合物。 They try to fit to a mixture with gaussians that have less overlap and therefor are typically less redundant. 他们尝试将高斯重叠的混合减少,因此冗余度通常较低。

I didn't find source code for it (yet). 我还没有找到源代码。

[1] https://papers.nips.cc/paper/4589-repulsive-mixtures.pdf [1] https://papers.nips.cc/paper/4589-repulsive-mixtures.pdf

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM