简体繁体 English

使用 sci-kit 学习的高斯过程回归

[英]Gaussian process regression with sci-kit learn

原文 2022-06-23 11:44:35 3 1 python/ scikit-learn

Context: in Gaussian Process (GP) regression we can use two approaches:背景：在高斯过程（GP）回归中，我们可以使用两种方法：

(I) Fit the kernel parameters via Maximum Likelihood (maximize data likelihood) and use the GP defined by these parameters for prediction. (I) 通过Maximum Likelihood（最大化数据似然）拟合内核参数，并使用这些参数定义的GP进行预测。

(II) Bayesian approach: put a parametric prior distribution on the kernel parameters. (II)贝叶斯方法：对核参数进行参数先验分布。 The parameters of this prior distribution are called the hyperparameters.这种先验分布的参数称为超参数。 Condition on the data to obtain a posterior distribution for the kernel parameters and now either以数据为条件以获得内核参数的后验分布，现在要么

(IIa) fit the kernel parameters by maximizing the posterior kernel-parameter likelihood (MAP parameters) and use the GP defined by the MAP-parameters for prediction, or (IIa) 通过最大化后验核参数似然（MAP 参数）来拟合核参数，并使用 MAP 参数定义的 GP 进行预测，或

(IIb) (the full Bayesian approach): predict using the mixture model which integrates all the GPs defined by the admissible kernel parameters along the posterior distribution of kernel-parameters. （IIb）（完整的贝叶斯方法）：使用混合模型进行预测，该模型将允许的核参数定义的所有 GP 沿核参数的后验分布进行整合。

(IIb) is the principal approach advocated in the reference [RW2006] cited in the package. (IIb) 是包中引用的参考文献 [RW2006] 中提倡的主要方法。

The point is that hyperparameters exist only in the Bayesian approach and are the parameters of the prior distribution on kernel parameters.关键是超参数仅存在于贝叶斯方法中，并且是内核参数的先验分布的参数。

Therefore I am confused about the use of the term "hyperparameters" in the documentation, eg here where it is stated that "Kernels are parameterized by a vector of hyperparameters".因此，我对文档中“超参数”一词的使用感到困惑，例如，此处声明“内核由超参数向量参数化”。

This must be interpreted as a sort of indirect parameterization via conditioning on the data as the hyperparameters do not directly determine the kernel parameters.这必须被解释为一种通过调节数据的间接参数化，因为超参数不直接确定内核参数。 Then an example is given of the exponential kernel and its length-scale parameter.然后给出了一个指数核及其长度尺度参数的例子。 This is definitely not a hyperparameter as this term is generally used.这绝对不是一个超参数，因为这个术语通常被使用。

No distinction seems to be drawn between kernel-parameters and hyperparameters.内核参数和超参数之间似乎没有区别。 This is confusing and it is now unclear if the package uses the Bayesian approach at all.这令人困惑，现在还不清楚包是否使用贝叶斯方法。 For example where do we specify the parametric family of prior distributions on kernel parameters?例如，我们在哪里指定内核参数的先验分布的参数族？

Question: does scikit-learn use approach (I) or (II)?问题： scikit-learn 使用方法（I）还是（II）？

Here is my own tentative answer: the confusion comes from the fact that a Gaussian Process is often called a "prior on functions" indicating some sort of Bayesianism.这是我自己的试探性答案：混淆来自这样一个事实，即高斯过程通常被称为“函数先验”，表明某种贝叶斯主义。 Worse still the process is infinite dimensional so restricting to the finite data dimensions is some sort of "marginalization".更糟糕的是，这个过程是无限维的，因此限制在有限的数据维度是某种“边缘化”。 This is also confusing since in general you have marginalization only in the Bayesian approach where you have a joint distribution of data and parameters, so you often marginalize out one or the other.这也令人困惑，因为通常您仅在贝叶斯方法中进行边缘化，在该方法中数据和参数的联合分布，因此您通常会边缘化其中一个。

The correct view here however is the following: the Gaussian Process is the model, the kernel parameters are the model parameters, in sci-kit learn there are no hyperparameters since there is no prior distribution on kernel parameters, the so called LML (log marginal likelihood) is ordinary data likelihood given the model parameters and the parameter-fit is ordinary maximum data-likelihood.然而，这里的正确观点如下：高斯过程是模型，内核参数是模型参数，在 sci-kit learn 中没有超参数，因为内核参数没有先验分布，即所谓的 LML（对数边际可能性）是给定模型参数的普通数据似然，参数拟合是普通最大数据似然。 In short the approach is (I) and not (II).简而言之，方法是（I）而不是（II）。

1 个解决方案

If you read the scikit-learn documentation on GP regression , you clearly see that the kernel (hyper)parameters are optimized .如果您阅读有关 GP 回归的 scikit-learn 文档，您会清楚地看到内核（超）参数已优化。 Take a look for example at the description of the argument n_restarts_optimizer : "The number of restarts of the optimizer for finding the kernel's parameters which maximize the log-marginal likelihood."以参数n_restarts_optimizer的描述为例：“优化器的重新启动次数，用于查找最大化对数边际可能性的内核参数。” In your question that is approach (i).在您的问题中，方法（i）。

I would note two more things though:不过，我还要注意两点：

In my mind, the fact that they are called "hyperparameters" automatically implies that they are deterministic and can be estimated directly.在我看来，它们被称为“超参数”这一事实自动意味着它们是确定性的并且可以直接估计。 Otherwise, they are random variables and that is why they can have a distribution.否则，它们是随机变量，这就是它们可以有分布的原因。 Another way to think of it is: did you define a prior for it?另一种思考方式是：您是否为它定义了先验？ If not, then it is a parameter!如果不是，那么它是一个参数！ If you did, then the prior's hyperparameter(s) may be what needs to be determined.如果你这样做了，那么可能需要确定先验的超参数。
Note that the GaussianProcessRegressor class "exposes a method log_marginal_likelihood(theta), which can be used externally for other ways of selecting hyperparameters, eg, via Markov chain Monte Carlo."请注意， GaussianProcessRegressor类“公开了一个方法 log_marginal_likelihood(theta)，该方法可以在外部用于其他选择超参数的方法，例如通过马尔可夫链蒙特卡罗。” So, technically it is possible to make it "fully Bayesian" (your approach (ii)) but you must provide the inference method.因此，从技术上讲，可以使其“完全贝叶斯”（您的方法（ii）），但您必须提供推理方法。