简体   繁体   中英

Posterior predictive check in PyMC3, discrete distributions

I want to implement a basic model of capture and recapture in PyMC3 (you capture 100 animals and mark them, then you liberate them and recapture 100 of them after they have mixed and annotate how many are marked). This is my code:

import numpy as np
import pymc3 as pm
import arviz as az

# Datos:
K = 100 #marked animals in first round
n = 100 #captured animals in second round
obs = 10 #observed marked in second round

with pm.Model() as my_model:
    N = pm.DiscreteUniform("N", lower=K, upper=10000)
    
    likelihood = pm.HyperGeometric('likelihood', N=N, k=K, n=n, observed=obs)

    trace = pm.sample(10000)
    
    print(pm.summary(trace))
    print(trace['N'])

    ppc = pm.sample_posterior_predictive(trace, 100, var_names=["N"])

    data_ppc = az.from_pymc3(trace=trace, posterior_predictive=ppc) #create inference data

az.plot_ppc(data_ppc, figsize=(12, 6))

But I obtain the error in plot_ppc that 'var names: "[\'likelihood\'] are not present" in dataset' . Also the warning posterior predictive variable N's shape not compatible with number of chains and draws. This can mean that some draws or even whole chains are not represented. posterior predictive variable N's shape not compatible with number of chains and draws. This can mean that some draws or even whole chains are not represented.

What is happening and what can I do to obtain a posterior predictive check plot?

The root of all the problems is in this line ppc = pm.sample_posterior_predictive(trace, 100, var_names=["N"]) .

By using var_names=["N"] you are indicating PyMC to "sample" only the variable N which is actually a latent variable that was sampled while sampling the posterior in the pm.sample call. Doing this in the pm.sample_posterior_predictive call is indicating PyMC to not sample the observed variable ( likelihood in this case) and to just copy the samples for N to the posterior predictive too. You will see that data_ppc is an InferenceData object with multiple groups, N is already in the posterior group (like it is in the trace object).

By using 100 (aka samples=100 as a positional argument) you are indicating PyMC to draw posterior predictive samples only for the first 100 draws of the first chain. This is a bad idea, so ArviZ prints a warning when converting to InferenceData . You should generate one posterior predictive sample per posterior sample, only generating samples for a subset of the posterior if posterior predictive sampling were very slow.

My recommendation, which also applies as a general rule, is to trust PyMC defaults unless you have a reason not to or want things to give the same result with multiple versions. We update the defaults from time to time to try and keep them coherent with best practices which are updated and improve over the time. You should therefore do: ppc = pm.sample_posterior_predictive(trace) . PyMC will default to sampling the likelihood variable only and to generate one sample per posterior draw.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM