如何在Scikit-learn中使用`Dirichlet Process Gaussian Mixture Model`？（n_components？）

Question

My understanding of "an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters" is that the number of clusters is determined by the data as they converge to a certain amount of clusters. 我对“Dirichlet过程作为聚类数量的先验分布的无限混合模型”的理解是，当聚合到一定数量的聚类时，聚类的数量由数据决定。

This R Implementation https://github.com/jacobian1980/ecostates decides on the number of clusters in this way. 这个R Implementation https://github.com/jacobian1980/ecostates以这种方式决定集群的数量。 Although, the R implementation uses a Gibbs sampler, I'm not sure if that affects this. 虽然， R implementation使用了Gibbs采样器，但我不确定是否会影响这一点。

What confuses me is the n_components parameters. 令我困惑的是n_components参数。 n_components: int, default 1 : Number of mixture components. If the number of components is determined by the data and the Dirichlet Process, then what is this parameter? 如果组件数量由数据和Dirichlet过程确定，那么这个参数是什么？

Ultimately, I'm trying to get: 最终，我想得到：

(1) the cluster assignment for each sample; （1）每个样本的聚类分配;

(2) the probability vectors for each cluster; （2）每个聚类的概率向量; and 和

(3) the likelihood/log-likelihood for each sample. （3）每个样本的似然/对数似然。

It looks like (1) is the predict method, and (3) is the score method. 看起来（1）是predict方法，（3）是score方法。 However, the output of (1) is completely dependent on the n_components hyperparameter. 但是，（1）的输出完全取决于n_components超参数。

My apologies if this is a naive question, I'm very new to Bayesian programming and noticed there was Dirichlet Process in Scikit-learn that I wanted to try out. 我很抱歉，如果这是一个天真的问题，我对贝叶斯编程很新，并注意到Scikit-learn有Dirichlet Process Scikit-learn我想尝试一下。

Here's the docs: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM 这是文档： http ： //scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM

Here's an example of usage: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html 以下是一个使用示例： http ： //scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html

Here's my naive usage: 这是我天真的用法：

from sklearn.mixture import DPGMM
X = pd.read_table("Data/processed/data.tsv", sep="\t", index_col=0)
Mod_dpgmm = DPGMM(n_components=3)
Mod_dpgmm.fit(X)

Answer 1

As mentioned by @maxymoo in the comments, n_components is a truncation parameter. 正如@maxymoo在评论中所提到的， n_components是一个截断参数。

In the context of the Chinese Restaurant Process, which is related to the Stick-breaking representation in sklearn's DP-GMM, a new data point joins an existing cluster k with probability |k| / n-1+alpha 在中国餐厅过程，这是在sklearn的DP-GMM相关棒破代表性的背景下，一个新的数据点加入现有集群k概率|k| / n-1+alpha |k| / n-1+alpha and starts a new cluster with probability alpha / n-1 + alpha . |k| / n-1+alpha并以概率alpha / n-1 + alpha 开始一个新的聚类。 This parameter can be interpreted as the concentration parameter of the Dirichlet Process and it will influence the final number of clusters. 该参数可以解释为Dirichlet过程的浓度参数，它将影响最终的簇数。

Unlike R's implementation that uses Gibbs sampling, sklearn's DP-GMM implementation uses variational inference. 与R使用Gibbs采样的实现不同，sklearn的DP-GMM实现使用变分推理。 This can be related to the difference in results. 这可能与结果的差异有关。

A gentle Dirichlet Process tutorial can be found here . 在这里可以找到温和的Dirichlet Process教程。

Answer 2

Now the Class DPGMM is decrecated. 现在，DPGMM类已经减少了。 as the warning show: DeprecationWarning: Class DPGMM is deprecated; 作为警告显示：DeprecationWarning：不推荐使用类DPGMM; The DPGMM class is not working correctly and it's better to use sklearn.mixture.BayesianGaussianMixture class with parameter weight_concentration_prior_type='dirichlet_process' instead. DPGMM类无法正常工作，最好使用sklearn.mixture.BayesianGaussianMixture类，参数weight_concentration_prior_type='dirichlet_process' 。 DPGMM is deprecated in 0.18 and will be removed in 0.20. DPGMM在0.18中弃用，将在0.20中删除。

如何在Scikit-learn中使用`Dirichlet Process Gaussian Mixture Model`？（n_components？）

问题描述

2 个解决方案

解决方案1
5 已采纳 2016-09-21 17:42:52

解决方案2
1 2018-12-28 07:22:04

如何在Scikit-learn中使用`Dirichlet Process Gaussian Mixture Model`？ （n_components？）

问题描述

2 个解决方案

解决方案1 5 已采纳 2016-09-21 17:42:52

解决方案2 1 2018-12-28 07:22:04

如何在Scikit-learn中使用`Dirichlet Process Gaussian Mixture Model`？（n_components？）

解决方案1
5 已采纳 2016-09-21 17:42:52

解决方案2
1 2018-12-28 07:22:04