优化Stan / rstan中的高斯过程

I have recently encountered Gaussian process models and happen to think that they may be the solution to a problem I have been working on in my lab. 我最近遇到过高斯过程模型,碰巧认为它们可能是我在实验室中遇到的问题的解决方案。 I have an open and related question on Cross Validated, but I wanted to separate out my modeling/math questions from my programming questions. 我有一个关于Cross Validated的开放和相关的问题,但我想从我的编程问题中分离出我的建模/数学问题。 Hence, this second, related post. 因此,这第二个相关的帖子。 If knowing more about the background of my problem would help though here is the link my open CV question . 如果了解更多关于我的问题的背景将有所帮助,虽然这里是我的公开简历问题的链接。

Here is my stan code that corresponds to the updated covariance functions presented in my CV post: 这是我的stan代码,对应于我的简历中提供的更新的协方差函数:

    //covariance function for main portion of the model
    matrix main_GP(
        int Nx,
        vector x,
        int Ny,
        vector y, 
        real alpha1,
        real alpha2,
        real alpha3,
        real rho1,
        real rho2,
        real rho3,
        real rho4,
        real rho5,
        real HR_f,
        real R_f){
                    matrix[Nx, Ny] K1;
                    matrix[Nx, Ny] K2;
                    matrix[Nx, Ny] K3;
                    matrix[Nx, Ny] Sigma;

                    //specifying random Gaussian process that governs covariance matrix
                    for(i in 1:Nx){
                        for (j in 1:Ny){
                            K1[i,j] = alpha1*exp(-square(x[i]-y[j])/2/square(rho1));

                    //specifying random Gaussian process incorporates heart rate
                    for(i in 1:Nx){
                        for(j in 1:Ny){
                            K2[i, j] = alpha2*exp(-2*square(sin(pi()*fabs(x[i]-y[j])*HR_f))/square(rho2))*

                    //specifying random Gaussian process incorporates heart rate as a function of respiration
                    for(i in 1:Nx){
                        for(j in 1:Ny){
                            K3[i, j] = alpha3*exp(-2*square(sin(pi()*fabs(x[i]-y[j])*HR_f))/square(rho4))*

                    Sigma = K1+K2+K3;
                    return Sigma;
    //function for posterior calculations
    vector post_pred_rng(
        real a1,
        real a2,
        real a3,
        real r1, 
        real r2,
        real r3,
        real r4,
        real r5,
        real HR,
        real R,
        real sn,
        int No,
        vector xo,
        int Np, 
        vector xp,
        vector yobs){
                matrix[No,No] Ko;
                matrix[Np,Np] Kp;
                matrix[No,Np] Kop;
                matrix[Np,No] Ko_inv_t;
                vector[Np] mu_p;
                matrix[Np,Np] Tau;
                matrix[Np,Np] L2;
                vector[Np] yp;

    //Kernel Multiple GPs for observed data
    Ko = main_GP(No, xo, No, xo, a1, a2, a3, r1, r2, r3, r4, r5, HR, R);
    Ko = Ko + diag_matrix(rep_vector(1, No))*sn;

    //kernel for predicted data
    Kp = main_GP(Np, xp, Np, xp, a1, a2, a3, r1, r2, r3, r4, r5, HR, R);
    Kp = Kp + diag_matrix(rep_vector(1, Np))*sn;

    //kernel for observed and predicted cross 
    Kop = main_GP(No, xo, Np, xp, a1, a2, a3, r1, r2, r3, r4, r5, HR, R);

    //Algorithm 2.1 of Rassmussen and Williams... 
    Ko_inv_t = Kop'/Ko;
    mu_p = Ko_inv_t*yobs;
    L2 = cholesky_decompose(Tau);
    yp = mu_p + L2*rep_vector(normal_rng(0,1), Np);
    return yp;

data { 
    int<lower=1> N1;
    int<lower=1> N2;
    vector[N1] X; 
    vector[N1] Y;
    vector[N2] Xp;
    real<lower=0> mu_HR;
    real<lower=0> mu_R;

transformed data { 
    vector[N1] mu;
    for(n in 1:N1) mu[n] = 0;

parameters {
    real loga1;
    real loga2;
    real loga3;
    real logr1;
    real logr2;
    real logr3;
    real logr4;
    real logr5;
    real<lower=.5, upper=3.5> HR;
    real<lower=.05, upper=.75> R;
    real logsigma_sq;

transformed parameters {
    real a1 = exp(loga1);
    real a2 = exp(loga2);
    real a3 = exp(loga3);
    real r1 = exp(logr1);
    real r2 = exp(logr2);
    real r3 = exp(logr3);
    real r4 = exp(logr4);
    real r5 = exp(logr5);
    real sigma_sq = exp(logsigma_sq);

    matrix[N1,N1] Sigma;
    matrix[N1,N1] L_S;

    //using GP function from above 
    Sigma = main_GP(N1, X, N1, X, a1, a2, a3, r1, r2, r3, r4, r5, HR, R);
    Sigma = Sigma + diag_matrix(rep_vector(1, N1))*sigma_sq;

    L_S = cholesky_decompose(Sigma);
    Y ~ multi_normal_cholesky(mu, L_S);

    //priors for parameters
    loga1 ~ student_t(3,0,1);
    loga2 ~ student_t(3,0,1);
    loga3 ~ student_t(3,0,1);
    logr1 ~ student_t(3,0,1);
    logr2 ~ student_t(3,0,1);
    logr3 ~ student_t(3,0,1);
    logr4 ~ student_t(3,0,1);
    logr5 ~ student_t(3,0,1);
    logsigma_sq ~ student_t(3,0,1);
    HR ~ normal(mu_HR,.25);
    R ~ normal(mu_R, .03);

generated quantities {
    vector[N2] Ypred;
    Ypred = post_pred_rng(a1, a2, a3, r1, r2, r3, r4, r5, HR, R, sigma_sq, N1, X, N2, Xp, Y);

I have tinkered around with the priors for the parameters included in my kernels, some parameterizations are a bit faster (up to two times faster in some instances, but can still produce relatively slow chains even in those cases). 我对内核中包含的参数进行了修改,一些参数化速度稍快一些(在某些情况下速度提高了两倍,但即使在这些情况下仍然会产生相对较慢的链)。

I am trying to predict values for 3.5s worth of data (sampled at 10 Hz - so 35 data points), using data from the 15 seconds preceding and following the contaminated section (sampled at 3.33 Hz so 100 total data points). 我试图使用受污染部分之前和之后15秒的数据(以3.33 Hz采样,因此总共100个数据点)来预测价值3.5s的数据(以10 Hz采样 - 因此35个数据点)的值。

The model as specified in R is as follows: R中指定的模型如下:

 fit.pred2 <- stan(file = 'Fast_GP6_all.stan',
                 data = dat, 
                 warmup = 1000,
                 iter = 1500,
                 chains = 3,
                 pars= pars.to.monitor

I do not know if I need that many warmup iterations to be honest. 我不知道我是否需要那么多热身迭代才能说实话。 I imagine part of the slow estimation is the result of fairly non-informative priors (except for heart rate and respiration HR & R as those have fairly well known ranges at rest in a healthy adult). 我想,缓慢估计的一部分是相当无信息的先验的结果(除了心率和呼吸HRR因为那些在健康成人中具有相当熟知的休息范围)。

Any suggestions are more than welcome to speed up my program's run time. 我们非常欢迎任何建议,以加快我的程序的运行时间。

Thanks. 谢谢。

You could grab the develop branch of the Stan Math Library, which has a recently updated version of multi_normal_cholesky that uses analytic gradients internally instead of autodiff. 您可以获取Stan Math Library的开发分支,该分支具有最近更新的multi_normal_cholesky版本,该版本在内部使用分析渐变而不是autodiff。 To do so, you can execute in R source("https://raw.githubusercontent.com/stan-dev/rstan/develop/install_StanHeaders.R") but you need to have CXXFLAGS+=-std=c++11 in your ~/.R/Makevars file and may need to reinstall the rstan package afterward. 为此,您可以在R source("https://raw.githubusercontent.com/stan-dev/rstan/develop/install_StanHeaders.R")执行source("https://raw.githubusercontent.com/stan-dev/rstan/develop/install_StanHeaders.R")但您需要使用CXXFLAGS+=-std=c++11 in你的〜/ .R / Makevars文件,之后可能需要重新安装rstan包。

Both multi_normal_cholesky and your main_GP are O(N^3), so your Stan program is never going to be especially fast but incremental optimizations of those two functions are going to make the biggest difference. multi_normal_cholesky和你的main_GP都是O(N ^ 3),所以你的Stan程序永远不会特别快,但这两个函数的增量优化将产生最大的不同。

Beyond that, there are some small things like Sigma = Sigma + diag_matrix(rep_vector(1, N1))*sigma_sq; 除此之外,还有一些小东西,如Sigma = Sigma + diag_matrix(rep_vector(1, N1))*sigma_sq; which should be changed to for (n in 1:N1) Sigma[n,n] += sigma_sq; 应改为for (n in 1:N1) Sigma[n,n] += sigma_sq; The reason is that multiplying sigma_sq by a diagonal matrix puts N1 squared nodes onto the autodiff tree, as does adding it to Sigma , which does a lot of memory allocation and deallocation. 原因是将sigma_sq乘以对角矩阵将N1平方节点放到自动生成树上,就像将它添加到SigmaSigma会进行大量的内存分配和释放。 The explicit loop along the diagonal only puts N1 nodes onto the autodiff tree, or maybe it just updates the existing tree if we are clever with the += operator. 沿着对角线的显式循环只将N1节点放到自动生成树上,或者如果我们用+=运算符聪明的话,它可能只更新现有的树。 Same deal inside your post_pred_rng function, although that is less critical because the generated quantities block is evaluated with doubles rather than the custom Stan type for reverse-mode autodiff. post_pred_rng函数中有相同的交易,虽然这不太重要,因为生成的数量块是用双精度计算的,而不是反向模式自动驾驶的自定义标准类型。

I think / hope that vector[N2] Ypred = post_pred_rng(...); 我想/希望vector[N2] Ypred = post_pred_rng(...); is slightly faster than vector[N2] Ypred; // preallocates Ypred with NaNs Ypred = post_pred_rng(...); vector[N2] Ypred; // preallocates Ypred with NaNs Ypred = post_pred_rng(...);略快vector[N2] Ypred; // preallocates Ypred with NaNs Ypred = post_pred_rng(...); vector[N2] Ypred; // preallocates Ypred with NaNs Ypred = post_pred_rng(...); by avoiding the preallocation step; 通过避免预分配步骤; in any event, it is nicer to read. 无论如何,它阅读起来更好。

Finally, while it does not affect the speed, you are not obligated to specify your parameters in log form and then antilog them in the transformed parameters block. 最后,虽然它不会影响速度,但您没有义务以日志形式指定参数,然后在转换后的参数块中对它们进行反对。 You can just declare things with <lower=0> and that will result in the same thing, although then you would be specifying your priors on the positively constrained things rather than the unconstrained things. 您可以使用<lower=0>声明事物并且这将导致相同的事情,尽管那时您将在正面约束的事物而不是无约束的事物上指定您的先验。 In most cases, that is more intuitive. 在大多数情况下,这更直观。 Those Student t priors with 3 degrees of freedom are very heavy-tailed, which may cause Stan to take a lot of leapfrog steps (up to its limit of 10 by default) at least during warmup. 那些具有3个自由度的学生是非常重的,这可能导致Stan至少在预热期间采取了很多跳跃步骤(默认情况下达到10的限制)。 The number of leapfrog steps (s) is the main contributor to the runtime since each iteration requires 2^s - 1 function / gradient evaluations. 越级步骤的数量是运行时的主要贡献者,因为每次迭代需要2^s - 1函数/梯度评估。

