[英]How to use `Dirichlet Process Gaussian Mixture Model` in Scikit-learn? (n_components?)
My understanding of "an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters" is that the number of clusters is determined by the data as they converge to a certain amount of clusters. 我对“Dirichlet过程作为聚类数量的先验分布的无限混合模型”的理解是,当聚合到一定数量的聚类时,聚类的数量由数据决定。
This R Implementation
https://github.com/jacobian1980/ecostates decides on the number of clusters in this way. 这个R Implementation
https://github.com/jacobian1980/ecostates以这种方式决定集群的数量。 Although, the R implementation
uses a Gibbs sampler, I'm not sure if that affects this. 虽然, R implementation
使用了Gibbs采样器,但我不确定是否会影响这一点。
What confuses me is the n_components
parameters. 令我困惑的是n_components
参数。 n_components: int, default 1 : Number of mixture components.
If the number of components is determined by the data and the Dirichlet Process, then what is this parameter? 如果组件数量由数据和Dirichlet过程确定,那么这个参数是什么?
Ultimately, I'm trying to get: 最终,我想得到:
(1) the cluster assignment for each sample; (1)每个样本的聚类分配;
(2) the probability vectors for each cluster; (2)每个聚类的概率向量; and 和
(3) the likelihood/log-likelihood for each sample. (3)每个样本的似然/对数似然。
It looks like (1) is the predict
method, and (3) is the score
method. 看起来(1)是predict
方法,(3)是score
方法。 However, the output of (1) is completely dependent on the n_components
hyperparameter. 但是,(1)的输出完全取决于n_components
超参数。
My apologies if this is a naive question, I'm very new to Bayesian programming and noticed there was Dirichlet Process
in Scikit-learn
that I wanted to try out. 我很抱歉,如果这是一个天真的问题,我对贝叶斯编程很新,并注意到Scikit-learn
有Dirichlet Process
Scikit-learn
我想尝试一下。
Here's the docs: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM 这是文档: http : //scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM
Here's an example of usage: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html 以下是一个使用示例: http : //scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html
Here's my naive usage: 这是我天真的用法:
from sklearn.mixture import DPGMM
X = pd.read_table("Data/processed/data.tsv", sep="\t", index_col=0)
Mod_dpgmm = DPGMM(n_components=3)
Mod_dpgmm.fit(X)
As mentioned by @maxymoo in the comments, n_components
is a truncation parameter. 正如@maxymoo在评论中所提到的, n_components
是一个截断参数。
In the context of the Chinese Restaurant Process, which is related to the Stick-breaking representation in sklearn's DP-GMM, a new data point joins an existing cluster k
with probability |k| / n-1+alpha
在中国餐厅过程,这是在sklearn的DP-GMM相关棒破代表性的背景下,一个新的数据点加入现有集群k
概率|k| / n-1+alpha
|k| / n-1+alpha
and starts a new cluster with probability alpha / n-1 + alpha
. |k| / n-1+alpha
并以概率alpha / n-1 + alpha
开始一个新的聚类。 This parameter can be interpreted as the concentration parameter of the Dirichlet Process and it will influence the final number of clusters. 该参数可以解释为Dirichlet过程的浓度参数,它将影响最终的簇数。
Unlike R's implementation that uses Gibbs sampling, sklearn's DP-GMM implementation uses variational inference. 与R使用Gibbs采样的实现不同,sklearn的DP-GMM实现使用变分推理。 This can be related to the difference in results. 这可能与结果的差异有关。
A gentle Dirichlet Process tutorial can be found here . 在这里可以找到温和的Dirichlet Process教程。
Now the Class DPGMM is decrecated. 现在,DPGMM类已经减少了。 as the warning show: DeprecationWarning: Class DPGMM is deprecated; 作为警告显示:DeprecationWarning:不推荐使用类DPGMM; The DPGMM
class is not working correctly and it's better to use sklearn.mixture.BayesianGaussianMixture
class with parameter weight_concentration_prior_type='dirichlet_process'
instead. DPGMM
类无法正常工作,最好使用sklearn.mixture.BayesianGaussianMixture
类,参数weight_concentration_prior_type='dirichlet_process'
。 DPGMM is deprecated in 0.18 and will be removed in 0.20. DPGMM在0.18中弃用,将在0.20中删除。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.