简体   繁体   English

PyMC3:分层橄榄球模型后部?

[英]PyMC3: Hierarchical rugby model posterior?

I've just started reading through the PyMC3 documentation (I'm much more comfortable with sklearn ) and came across the Rugby hierarchical model example :我刚刚开始通过阅读PyMC3文档(我更舒服sklearn )和跨来到橄榄球层次模型例子

# Imports and Rugby data setup -- model in next section

import numpy as np
import pandas as pd
import pymc3 as pm
import theano.tensor as tt
import matplotlib.pyplot as plt
import seaborn as sns

games = [
    ['Wales', 'Italy', 23, 15],
    ['France', 'England', 26, 24],
    ['Ireland', 'Scotland', 28, 6],
    ['Ireland', 'Wales', 26, 3],
    ['Scotland', 'England', 0, 20],
    ['France', 'Italy', 30, 10],
    ['Wales', 'France', 27, 6],
    ['Italy', 'Scotland', 20, 21],
    ['England', 'Ireland', 13, 10],
    ['Ireland', 'Italy', 46, 7],
    ['Scotland', 'France', 17, 19],
    ['England', 'Wales', 29, 18],
    ['Italy', 'England', 11, 52],
    ['Wales', 'Scotland', 51, 3],
    ['France', 'Ireland', 20, 22],
]
columns = ['home_team', 'away_team', 'home_score', 'away_score']
df = pd.DataFrame(games, columns=columns)

teams = df.home_team.unique()
teams = pd.DataFrame(teams, columns=['team'])
teams['i'] = teams.index

df = pd.merge(df, teams, left_on='home_team', right_on='team', how='left')
df = df.rename(columns = {'i': 'i_home'}).drop('team', 1)
df = pd.merge(df, teams, left_on='away_team', right_on='team', how='left')
df = df.rename(columns = {'i': 'i_away'}).drop('team', 1)

observed_home_goals = df.home_score.values
observed_away_goals = df.away_score.values

home_team = df.i_home.values
away_team = df.i_away.values

num_teams = len(df.i_home.drop_duplicates())
num_games = len(home_team)

g = df.groupby('i_away')
att_starting_points = np.log(g.away_score.mean())
g = df.groupby('i_home')
def_starting_points = -np.log(g.away_score.mean())

Here's the main PyMC3 model setup:这是主要的PyMC3模型设置:

with pm.Model() as model:
    # Global model parameters
    home = pm.Normal('home', 0, tau=.0001)
    tau_att = pm.Gamma('tau_att', .1, .1)
    tau_def = pm.Gamma('tau_def', .1, .1)
    intercept = pm.Normal('intercept', 0, tau=.0001)

    # Team-specific model parameters
    atts_star = pm.Normal('atts_star', mu=0, tau=tau_att, shape=num_teams)
    defs_star = pm.Normal('defs_star', mu=0, tau=tau_def, shape=num_teams)

    atts = pm.Deterministic('atts', atts_star - tt.mean(atts_star))
    defs = pm.Deterministic('defs', defs_star - tt.mean(defs_star))
    home_theta = tt.exp(intercept + home + atts[home_team] + defs[away_team])
    away_theta = tt.exp(intercept + atts[away_team] + defs[home_team])

    # Likelihood of observed data
    home_points = pm.Poisson('home_points', mu=home_theta, observed=observed_home_goals)
    away_points = pm.Poisson('away_points', mu=away_theta, observed=observed_away_goals)

    start = pm.find_MAP()
    step = pm.NUTS(state=start)
    trace = pm.sample(20000, step, init=start) 

I know how to plot the trace :我知道如何绘制trace

pm.traceplot(trace[5000:])

And generate posterior predictive samples :并生成后验预测样本

ppc = pm.sample_ppc(trace[5000:], samples=500, model=model)

What I'm unsure of: how do I ask questions of the model/posterior?我不确定的是:我如何向模特/后视镜提问?

For example, I'm assuming the distribution of scores for Wales vs Italy matchups would be:例如,我假设Wales vs Italy比赛的得分分布是:

# Wales vs Italy is the first matchup in our dataset
home_wales = ppc['home_points'][:, 0]
away_italy = ppc['away_points'][:, 0]

But what about matchups that aren't recorded in the original data?但是原始数据中没有记录的对局呢?

  • If Italy plays at home against France, what does their score distribution look like?如果意大利主场迎战法国,他们的得分分布是怎样的?
  • If Italy plays at home against France, how often do both teams score under 15?如果意大利在主场对阵法国,两支球队得分低于 15 的频率是多少?

Thanks for any help/insights.感谢您提供任何帮助/见解。

I'm fairly certain I was able to figure this out after reading through the PyMC3 Hierarchical Partial Pooling example .我相当确定在阅读PyMC3 Hierarchical Partial Pooling 示例后我能够弄清楚这一点。 Answering the questions in order:依次回答问题:

  1. Yes, that is what the distribution for Wales vs Italy matchups would be (since it's the first game in the observed data).是的,这就是Wales vs Italy比赛的分布情况(因为这是观察数据中的第一场比赛)。

  2. In order to predict Italy vs France (since the two teams did not play against each other in our original dataset), we would need prediction thetas .为了预测Italy vs France对阵Italy vs France (因为在我们的原始数据集中两支球队没有交手),我们需要预测thetas

Here's the code to the updated model:这是更新模型的代码:

# Setup the model similarly to the previous one...
with pm.Model() as model:
    # Global model parameters
    home = pm.Normal('home', 0, tau=.0001)
    tau_att = pm.Gamma('tau_att', .1, .1)
    tau_def = pm.Gamma('tau_def', .1, .1)
    intercept = pm.Normal('intercept', 0, tau=.0001)

    # Team-specific model parameters
    atts_star = pm.Normal('atts_star', mu=0, tau=tau_att, shape=num_teams)
    defs_star = pm.Normal('defs_star', mu=0, tau=tau_def, shape=num_teams)

    atts = pm.Deterministic('atts', atts_star - tt.mean(atts_star))
    defs = pm.Deterministic('defs', defs_star - tt.mean(defs_star))
    home_theta = tt.exp(intercept + home + atts[home_team] + defs[away_team])
    away_theta = tt.exp(intercept + atts[away_team] + defs[home_team])

    # Likelihood of observed data
    home_points = pm.Poisson('home_points', mu=home_theta, observed=observed_home_goals)
    away_points = pm.Poisson('away_points', mu=away_theta, observed=observed_away_goals)

# Now for predictions with no games played...
with model:
    # IDs from `teams` DataFrame
    italy, france = 4, 1
    # New `thetas` for Italy vs France predictions
    pred_home_theta = tt.exp(intercept + home + atts[italy] + defs[france])
    pred_away_theta = tt.exp(intercept + atts[france] + defs[italy])
    pred_home_points = pm.Poisson('pred_home_points', mu=pred_home_theta)
    pred_away_points = pm.Poisson('pred_away_points', mu=pred_away_theta)

# Sample the final model
with model:
    start = pm.find_MAP()
    step = pm.NUTS(state=start)
    trace = pm.sample(20000, step, init=start)

Once the trace is completed, we can plot the predictions: trace完成后,我们可以绘制预测:

# Use 5,000 as MCMC burn in
pred = pd.DataFrame({
    "italy": trace["pred_home_points"][5000:],
    "france": trace["pred_away_points"][5000:],
})
# Plot the distributions
sns.kdeplot(pred.italy, shade=True, label="Italy")
sns.kdeplot(pred.france, shade=True, label="France")
plt.show()

意大利对法国橄榄球分布

How often does Italy win at home?意大利在主场获胜的频率如何?

# 19% of the time
(pred.italy > pred.france).mean()

How often do both teams score under 15?两支球队得分低于 15 的频率如何?

# 0.7% of the time
1.0 * len(pred[(pred.italy <= 15) & (pred.france <= 15)]) / len(pred)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM