简体   繁体   English

高斯过程预测置信区间奇数

[英]Gaussian-Process Prediction Confidence Interval Oddities

I'm doing some particle physics analysis and was hoping someone out there could give me some insight on a Gaussian-Process fit I'm trying to use to extrapolate some data.我正在做一些粒子物理分析,希望有人能给我一些关于高斯过程拟合的见解,我正试图用它来推断一些数据。

I have data with uncertainties that I'm feeding in to the scikit-learn GaussianProcess algorithm.我有不确定的数据,我将这些数据输入到 scikit-learn GaussianProcess 算法中。 I'm including the uncertanties via the "nugget" argument (my implementation matches a standard example here where my "corr" is squared exponential and the "nugget" values are set to (dy/y)**2).我通过“nugget”参数包含不确定性(我的实现与此处的标准示例相匹配其中我的“corr”平方指数并且“nugget”值设置为(dy/y)**2)。 The main concern is this: I have low absolute uncertainty (but high fractional uncertainty) at the edges of the distribution and this is producing a predicted confidence interval much larger than I expect in this region (see plot below).主要问题是:我在分布边缘的绝对不确定性较低(但部分不确定性较高),这产生了比我在该区域预期大得多的预测置信区间(见下图)。

数据点和GP回归

The reason the uncertainties behave this way is that i'm dealing with particle physics data which is a histogram of counts of particles observed with different feature (x) values.不确定性表现出这种方式的原因是我正在处理粒子物理数据,它是用不同特征 (x) 值观察到的粒子计数的直方图。 These counts follow a Poisson distribution and thus have an uncertainty(standard deviation) of sqrt(N).这些计数遵循泊松分布,因此具有 sqrt(N) 的不确定性(标准偏差)。 So the higher count regions of the distribution have higher absolute, but lower fractional uncertainty, and vice versa for the low count regions.因此,分布的较高计数区域具有较高的绝对值,但分数不确定性较低,对于低计数区域,反之亦然。

I understand, as I mentioned, that the "nugget" argument in this function should have values of (fractional uncertainty)**2 when working with a squared exponential kernel.我知道,正如我所提到的,当使用平方指数内核时,这个函数中的“nugget”参数应该具有(分数不确定性)**2 的值。 So it makes sense that if the predicted uncertainty is based on a fractional uncertainty of the input that it could be large on the edges.因此,如果预测的不确定性基于输入的部分不确定性,那么它在边缘上可能很大是有道理的。 But I don't understand completely how this plays out in the math, and the size of the predicted uncertainty is SO much larger than the data point uncertainties on the edges that it seems wrong to me.但我不完全理解这在数学中是如何发挥作用的,预测不确定性的大小比边缘上的数据点不确定性大得多,这对我来说似乎是错误的。

Can anyone comment on what's going on here?任何人都可以评论这里发生的事情吗? Is this behaving as expected?这是否符合预期? If so, why?如果是这样,为什么? Any thoughts or references to further reading on the subject would be greatly appreciated!任何关于该主题的进一步阅读的想法或参考将不胜感激!

I'll leave you with a couple important caveats:我会给你留下几个重要的警告:

1) there are several data points with zero counts in the edges of the distribution. 1) 分布边缘有多个计数为零的数据点。 This throws a kink in the fractional uncertainty for the "nugget" because (sqrt(0)/0)**2 is not a very happy value.这会导致“金块”的分数不确定性出现问题,因为 (sqrt(0)/0)**2 不是一个非常令人满意的值。 I made an adjustment here of just setting the nugget value for these points to 1.0, which corresponds to the value you get if this is a count of 1. I believe this is a common approximation which does affect the question at hand, but I don't think it fundamentally changes the issue.我在这里做了一个调整,只是将这些点的金块值设置为 1.0,如果这是 1 的计数,这对应于你得到的值。我相信这是一个常见的近似值,它确实会影响手头的问题,但我不认为它不会从根本上改变问题。

2) The data i'm working with is actually a 2d histogram (ie, one independent variable (lets say x), another (y) and the counts as the dependent variable (z)). 2)我正在处理的数据实际上是一个二维直方图(即,一个自变量(比如 x),另一个(y)和计数作为因变量(z))。 The plot shown is a 1d slice of the 2d data and prediction (ie z vs x integrated over a small range of y).显示的图是二维数据和预测的一维切片(即 z 与 x 在小范围 y 上的积分)。 I don't think this really affects the question at hand but I thought i'd mention it.我认为这不会真正影响手头的问题,但我想我会提到它。

From you presentation, I suspect the behavior is correct, though I have not stepped through the math.从您的介绍中,我怀疑这种行为是正确的,尽管我还没有通过数学计算。 My instinct is telling me: don't do a uniform histogram.我的直觉告诉我:不要做统一的直方图。 Make the bin sizes larger as you transition away from the distribution center.当您从配送中心转移时,使垃圾箱尺寸更大。 That will increase your values and decrease your fractional errors.这将增加您的价值并减少您的分数错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM