简体   繁体   English

如何在Scikit-learn中使用`Dirichlet Process Gaussian Mixture Model`? (n_components?)

[英]How to use `Dirichlet Process Gaussian Mixture Model` in Scikit-learn? (n_components?)

My understanding of "an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters" is that the number of clusters is determined by the data as they converge to a certain amount of clusters. 我对“Dirichlet过程作为聚类数量的先验分布的无限混合模型”的理解是,当聚合到一定数量的聚类时,聚类的数量由数据决定。

This R Implementation https://github.com/jacobian1980/ecostates decides on the number of clusters in this way. 这个R Implementation https://github.com/jacobian1980/ecostates以这种方式决定集群的数量。 Although, the R implementation uses a Gibbs sampler, I'm not sure if that affects this. 虽然, R implementation使用了Gibbs采样器,但我不确定是否会影响这一点。

What confuses me is the n_components parameters. 令我困惑的是n_components参数。 n_components: int, default 1 : Number of mixture components. If the number of components is determined by the data and the Dirichlet Process, then what is this parameter? 如果组件数量由数据和Dirichlet过程确定,那么这个参数是什么?


Ultimately, I'm trying to get: 最终,我想得到:

(1) the cluster assignment for each sample; (1)每个样本的聚类分配;

(2) the probability vectors for each cluster; (2)每个聚类的概率向量; and

(3) the likelihood/log-likelihood for each sample. (3)每个样本的似然/对数似然。

It looks like (1) is the predict method, and (3) is the score method. 看起来(1)是predict方法,(3)是score方法。 However, the output of (1) is completely dependent on the n_components hyperparameter. 但是,(1)的输出完全取决于n_components超参数。

My apologies if this is a naive question, I'm very new to Bayesian programming and noticed there was Dirichlet Process in Scikit-learn that I wanted to try out. 我很抱歉,如果这是一个天真的问题,我对贝叶斯编程很新,并注意到Scikit-learnDirichlet Process Scikit-learn我想尝试一下。


Here's the docs: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM 这是文档: http//scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM

Here's an example of usage: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html 以下是一个使用示例: http//scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html

Here's my naive usage: 这是我天真的用法:

from sklearn.mixture import DPGMM
X = pd.read_table("Data/processed/data.tsv", sep="\t", index_col=0)
Mod_dpgmm = DPGMM(n_components=3)
Mod_dpgmm.fit(X)

As mentioned by @maxymoo in the comments, n_components is a truncation parameter. 正如@maxymoo在评论中所提到的, n_components是一个截断参数。

In the context of the Chinese Restaurant Process, which is related to the Stick-breaking representation in sklearn's DP-GMM, a new data point joins an existing cluster k with probability |k| / n-1+alpha 在中国餐厅过程,这是在sklearn的DP-GMM相关棒破代表性的背景下,一个新的数据点加入现有集群k概率|k| / n-1+alpha |k| / n-1+alpha and starts a new cluster with probability alpha / n-1 + alpha . |k| / n-1+alpha并以概率alpha / n-1 + alpha 开始一个新的聚类。 This parameter can be interpreted as the concentration parameter of the Dirichlet Process and it will influence the final number of clusters. 该参数可以解释为Dirichlet过程的浓度参数,它将影响最终的簇数。

Unlike R's implementation that uses Gibbs sampling, sklearn's DP-GMM implementation uses variational inference. 与R使用Gibbs采样的实现不同,sklearn的DP-GMM实现使用变分推理。 This can be related to the difference in results. 这可能与结果的差异有关。

A gentle Dirichlet Process tutorial can be found here . 这里可以找到温和的Dirichlet Process教程。

Now the Class DPGMM is decrecated. 现在,DPGMM类已经减少了。 as the warning show: DeprecationWarning: Class DPGMM is deprecated; 作为警告显示:DeprecationWarning:不推荐使用类DPGMM; The DPGMM class is not working correctly and it's better to use sklearn.mixture.BayesianGaussianMixture class with parameter weight_concentration_prior_type='dirichlet_process' instead. DPGMM类无法正常工作,最好使用sklearn.mixture.BayesianGaussianMixture类,参数weight_concentration_prior_type='dirichlet_process' DPGMM is deprecated in 0.18 and will be removed in 0.20. DPGMM在0.18中弃用,将在0.20中删除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 当 n_components 为 None 时如何解释 Scikit-learn 的 PCA? - How to interpret Scikit-learn's PCA when n_components are None? 如何在scikit-learn下绘制拟合高斯混合模型的概率密度函数? - How can I plot the probability density function for a fitted Gaussian mixture model under scikit-learn? scikit-learn高斯混合模型集基于高斯均值的预测输出 - scikit-learn Gaussian Mixture Model set prediction output based on Gaussian's mean 如何指定 scikit-learn 的高斯过程回归的先验? - How to specify the prior for scikit-learn's Gaussian process regression? 高斯混合模型:Spark MLlib和scikit-learn之间的区别 - Gaussian Mixture Models: Difference between Spark MLlib and scikit-learn 高斯过程scikit-learn - 异常 - Gaussian Process scikit-learn - Exception scikit-learn 高斯过程中的线性基 - Linear basis in scikit-learn Gaussian Process 如何正确使用scikit-learn的高斯过程进行2D输入,1D输出回归? - How to correctly use scikit-learn's Gaussian Process for a 2D-inputs, 1D-output regression? 使用scikit-learn的高斯过程回归时如何使用点的经纬度? - How to use the latitude/longitude of points when using scikit-learn's Gaussian Process Regression? 以二维特征数组为输入的高斯过程 - scikit-learn - Gaussian process with 2D feature array as input - scikit-learn
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM