I am using PyMC to fit some data to a straight line. The data have outliers, so I adapted some code (third example at the link) written by Jake Vanderplas for his textbook. The method uses a vector variable qi
to encode whether each individual data point belongs to the foreground model (which we are fitting to the line) or the background model, which we don't care about.
class lin_fit_ol(object):
'''
fit a straight line to one independent variable
(`xi`, with zero errors) and one dependent variable
(`yi`, with possibly heteroscedastic errors `dyi`)
Outliers in `yi` are permitted
Intended to be a complement to a straight-line fit, for model
testing purposes
Modified from Vanderplas's code
(found at http://www.astroml.\
org/book_figures/chapter8/fig_outlier_rejection.html)
'''
def __init__(self, xi, yi, dyi, value):
self.xi, self.yi, self.dyi, self.value = xi, yi, dyi, value
@pymc.stochastic
def beta(value=np.array([0.5, 1.0])):
"""Slope and intercept parameters for a straight line.
The likelihood corresponds to the prior probability of the parameters."""
slope, intercept = value
prob_intercept = 1 + 0 * intercept
# uniform prior on theta = arctan(slope)
# d[arctan(x)]/dx = 1 / (1 + x^2)
prob_slope = np.log(1. / (1. + slope ** 2))
return prob_intercept + prob_slope
@pymc.deterministic
def model(xi=xi, beta=beta):
slope, intercept = beta
return slope * xi + intercept
# uniform prior on Pb, the fraction of bad points
Pb = pymc.Uniform('Pb', 0, 1.0, value=0.1)
# uniform prior on Yb, the centroid of the outlier distribution
Yb = pymc.Uniform('Yb', -10000, 10000, value=0)
# uniform prior on log(sigmab), the spread of the outlier distribution
log_sigmab = pymc.Uniform('log_sigmab', -10, 10, value=5)
# qi is bernoulli distributed
# Note: this syntax requires pymc version 2.2
qi = pymc.Bernoulli('qi', p=1 - Pb, value=np.ones(len(xi)))
@pymc.deterministic
def sigmab(log_sigmab=log_sigmab):
return np.exp(log_sigmab)
def outlier_likelihood(yi, mu, dyi, qi, Yb, sigmab):
"""likelihood for full outlier posterior"""
Vi = dyi ** 2
Vb = sigmab ** 2
root2pi = np.sqrt(2 * np.pi)
logL_in = -0.5 * np.sum(
qi * (np.log(2 * np.pi * Vi) + (yi - mu) ** 2 / Vi))
logL_out = -0.5 * np.sum(
(1 - qi) * (np.log(2 * np.pi * (Vi + Vb)) +
(yi - Yb) ** 2 / (Vi + Vb)))
return logL_out + logL_in
OutlierNormal = pymc.stochastic_from_dist(
'outliernormal', logp=outlier_likelihood, dtype=np.float,
mv=True)
y_outlier = OutlierNormal(
'y_outlier', mu=model, dyi=dyi, Yb=Yb, sigmab=sigmab, qi=qi,
observed=True, value=yi)
self.M = dict(y_outlier=y_outlier, beta=beta, model=model,
qi=qi, Pb=Pb, Yb=Yb, log_sigmab=log_sigmab,
sigmab=sigmab)
self.sample_invoked = False
def sample(self, iter, burn, calc_deviance=True):
self.S0 = pymc.MCMC(self.M)
self.S0.sample(iter=iter, burn=burn)
self.trace = self.S0.trace('beta')
self.btrace = self.trace[:, 0]
self.mtrace = self.trace[:, 1]
self.sample_invoked = True
def triangle(self):
assert self.sample_invoked == True, \
'Must sample first! Use sample(iter, burn)'
corner(self.trace[:], labels=['$m$', '$b$'])
def plot(self, xlab='$x$', ylab='$y$'):
# plot the data points
plt.errorbar(self.xi, self.yi, yerr=self.dyi, fmt='.k')
# do some shimmying to get quantile bounds
xa = np.linspace(self.xi.min(), self.xi.max(), 100)
A = np.vander(xa, 2)
# generate all possible lines
lines = np.dot(self.trace[:], A.T)
quantiles = np.percentile(lines, [16, 84], axis=0)
plt.fill_between(xa, quantiles[0], quantiles[1],
color="#8d44ad", alpha=0.5)
# plot circles around points identified as outliers
qi = self.S0.trace('qi')[:]
Pi = qi.astype(float).mean(0)
outlier_x = self.xi[Pi < 0.32]
outlier_y = self.yi[Pi < 0.32]
plt.scatter(outlier_x, outlier_y, lw=1, s=400, alpha=0.5,
facecolors='none', edgecolors='red')
plt.xlabel(xlab)
plt.ylabel(ylab)
def ICs(self):
self.MAP = pymc.MAP(self.M)
self.MAP.fit()
self.BIC = self.MAP.BIC
self.AIC = self.MAP.AIC
self.logp = self.MAP.logp
self.logp_at_max = self.MAP.logp_at_max
return self.AIC, self.BIC
So, when we calculate the BIC and AIC using this model, we get very large values (since there are lots of points). This makes total sense. However, this disfavors having many data points, which irks me. Plus, the large AIC and BIC would make a casual observer believe that the other model (which fits poorly as a result of the outliers) is actually the better model .
Am I missing a subtlety of the BIC and AIC here, or is a harsh reality of using mixture models that you always have to use a bunch of extra binary parameters to denote the membership of your datapoints?
I recommend the book "Introduction to statistical learning"
On page 212 you find the formulas for AIC and BIC. In each of these formulas the sample number is in the denominator. Hence, the result should not be influenced by the number of samples. At least not in that obvious way.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.