简体   繁体   English

如何从PyMC3中的Dirichlet过程中提取无监督的聚类?

[英]How to extract unsupervised clusters from a Dirichlet Process in PyMC3?

I just finished the Bayesian Analysis in Python book by Osvaldo Martin (great book to understand bayesian concepts and some fancy numpy indexing). 我刚刚完成了Osvaldo Martin的 Python书中的贝叶斯分析 (理解贝叶斯概念和一些花哨的numpy索引的好书)。

I really want to extend my understanding to bayesian mixture models for unsupervised clustering of samples. 我真的想将我的理解扩展到贝叶斯混合模型,用于无监督的样本聚类。 All of my google searches have led me to Austin Rochford's tutorial which is really informative. 我所有的谷歌搜索都让我看到了Austin Rochford的教程 ,这本教程非常有用。 I understand what is happening but I am unclear in how this can be adapted to clustering (especially using multiple attributes for the cluster assignments but that is a different topic). 我理解发生了什么,但我不清楚它如何适应群集 (特别是使用群集分配的多个属性,但这是一个不同的主题)。

I understand how to assign the priors for the Dirichlet distribution but I can't figure out how to get the clusters in PyMC3 . 我理解如何为Dirichlet distribution分配先验,但我无法弄清楚如何在PyMC3获得集群。 It looks like the majority of the mus converge to the centroids (ie the means of the distributions I sampled from) but they are still separate components . 看起来大部分的mus收敛到质心(即我从中采样的分布方式),但它们仍然是单独的components I thought about making a cutoff for the weights ( w in the model) but that doesn't seem to work the way I imagined since multiple components have slightly different mean parameters mus that are converging. 我想过做一个截止的weightsw模型),但似乎并没有工作,我想象中的方式,因为多个components具有略微不同的平均参数mus正在趋同。

How can I extract the clusters (centroids) from this PyMC3 model? 如何从此PyMC3模型中提取聚类(质心)? I gave it a maximum of 15 components that I want to converge to 3 . 我给它最多15组件,我想收敛到3 The mus seem to be at the right location but the weights are messed up b/c they are being distributed between the other clusters so I can't use a weight threshold (unless I merge them but I don't think that's the way it is normally done). mus似乎在正确的位置,但权重混乱b / c他们正在其他集群之间分配,所以我不能使用权重阈值(除非我合并他们但我不认为这是它的方式通常是完成的)。

import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
import multiprocessing
import seaborn as sns
import pandas as pd
import theano.tensor as tt
%matplotlib inline

# Clip at 15 components
K = 15

# Create mixture population
centroids = [0, 10, 50]
weights = [(2/5),(2/5),(1/5)]

mix_3 = np.concatenate([np.random.normal(loc=centroids[0], size=int(150*weights[0])), # 60 samples
                        np.random.normal(loc=centroids[1], size=int(150*weights[1])), # 60 samples
                        np.random.normal(loc=centroids[2], size=int(150*weights[2]))])# 30 samples
n = mix_3.size

在此输入图像描述

# Create and fit model
with pm.Model() as Mod_dir:
    alpha = pm.Gamma('alpha', 1., 1.)

    beta = pm.Beta('beta', 1., alpha, shape=K)

    w = pm.Deterministic('w', beta * tt.concatenate([[1], tt.extra_ops.cumprod(1 - beta)[:-1]]))

    component = pm.Categorical('component', w, shape=n)

    tau = pm.Gamma("tau", 1.0, 1.0, shape=K)

    mu = pm.Normal('mu', 0, tau=tau, shape=K)

    obs = pm.Normal('obs',
                    mu[component], 
                    tau=tau[component],
                    observed=mix_3)

    step1 = pm.Metropolis(vars=[alpha, beta, w, tau, mu, obs])
#     step2 = pm.CategoricalGibbsMetropolis(vars=[component])
    step2 = pm.ElemwiseCategorical([component], np.arange(K)) # Much, much faster than the above

    tr = pm.sample(1e4, [step1, step2], njobs=multiprocessing.cpu_count())

#burn-in = 1000, thin by grabbing every 5th idx
pm.traceplot(tr[1e3::5])

在此输入图像描述

Similar questions below 类似的问题如下

https://stats.stackexchange.com/questions/120209/pymc3-dirichlet-distribution for regression and not clustering https://stats.stackexchange.com/questions/120209/pymc3-dirichlet-distribution for regression and not clustering

https://stats.stackexchange.com/questions/108251/image-clustering-and-dirichlet-process theory on the DP process 关于DP流程的https://stats.stackexchange.com/questions/108251/image-clustering-and-dirichlet-process理论

https://stats.stackexchange.com/questions/116311/draw-a-multinomial-distribution-from-a-dirichlet-distribution explains DP https://stats.stackexchange.com/questions/116311/draw-a-multinomial-distribution-from-a-dirichlet-distribution解释DP

Dirichlet process in PyMC 3 directs me to Austin Rochford's tutorial above PyMC 3中的Dirichlet过程指导我上面的Austin Rochford教程

Using a couple of new-ish additions to pymc3 will help make this clear. 使用pymc3的几个新增功能将有助于明确这一点。 I think I updated the Dirichlet Process example after they were added, but it seems to have been reverted to the old version during a documentation cleanup; 我想我在添加后更新了Dirichlet Process示例,但在文档清理期间似乎已经恢复到旧版本; I will fix that soon. 我很快就会解决这个问题。

One of the difficulties is that the data you have generated is much more dispersed than the priors on the component means can accommodate; 其中一个困难是,您生成的数据比组件均可容纳的先验更加分散; if you standardize your data, the samples should mix much more quickly. 如果您标准化您的数据,样本应该更快地混合。

The second is that pymc3 now supports mixture distributions where the indicator variable component has been marginalized out. 第二个是pymc3现在支持混合物分布,其中指标变量component被边缘化了。 These marginal mixture distributions will help accelerate mixing and allow you to use NUTS (initialized with ADVI). 这些边际混合物分布将有助于加速混合并允许您使用NUTS(使用ADVI初始化)。

Finally, with these truncated versions of infinite models, when encountering computational problems, it is often useful to increase the number of potential components. 最后,对于无限模型的这些截断版本,当遇到计算问题时,增加潜在组件的数量通常很有用。 I have found that K = 30 works better for this model than K = 15 . 我发现K = 30对于这个模型比K = 15更好。

The following code implements these changes and shows how the "active" component means can be extracted. 以下代码实现了这些更改,并显示了如何提取“活动”组件的含义。

from matplotlib import pyplot as plt
import numpy as np
import pymc3 as pm
import seaborn as sns
from theano import tensor as T

blue = sns.color_palette()[0]

np.random.seed(462233) # from random.org

N = 150

CENTROIDS = np.array([0, 10, 50])
WEIGHTS = np.array([0.4, 0.4, 0.2])

x = np.random.normal(CENTROIDS[np.random.choice(3, size=N, p=WEIGHTS)], size=N)
x_std = (x - x.mean()) / x.std()

fig, ax = plt.subplots(figsize=(8, 6))

ax.hist(x_std, bins=30);

Standardized data 标准化数据

K = 30

with pm.Model() as model:
    alpha = pm.Gamma('alpha', 1., 1.)
    beta = pm.Beta('beta', 1., alpha, shape=K)
    w = pm.Deterministic('w', beta * T.concatenate([[1], T.extra_ops.cumprod(1 - beta)[:-1]]))

    tau = pm.Gamma('tau', 1., 1., shape=K)
    lambda_ = pm.Uniform('lambda', 0, 5, shape=K)
    mu = pm.Normal('mu', 0, tau=lambda_ * tau, shape=K)
    obs = pm.NormalMixture('obs', w, mu, tau=lambda_ * tau,
                           observed=x_std)

with model:
    trace = pm.sample(2000, n_init=100000)

fig, ax = plt.subplots(figsize=(8, 6))

ax.bar(np.arange(K) - 0.4, trace['w'].mean(axis=0));

We see that three components appear to be used, and that their weights are reasonably close to the true values. 我们看到似乎使用了三个组件,并且它们的权重合理地接近真实值。

Mixture weights 混合物重量

Finally, we see that the posterior expected means of these three components match the true (standardized) means fairly well. 最后,我们看到这三个组成部分的后验预期方法与真实(标准化)方法相当匹配。

trace['mu'].mean(axis=0)[:3]

array([-0.73763891, -0.17284594, 2.10423978]) 数组([ - 0.73763891,-0.17284594,2.10423978])

(CENTROIDS - x.mean()) / x.std()

array([-0.73017789, -0.16765707, 2.0824262 ]) 数组([ - 0.73017789,-0.16765707,2.0824262])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM