简体   繁体   English

如何处理stan中缺失的数据?

[英]How to deal with the missing data in stan?

I am a newbie to stan and I am implementing the probabilistic matrix factorization model.我是 stan 的新手,我正在实施概率矩阵分解模型。

Given a user-item rating matrix:给定一个用户-项目评分矩阵:

                       item
 user     1    3   NA   4     5    NA
          2    0    3   NA    1     5
          1    1    NA  NA    NA    0
          ....

How should I represent the observable data in the data block and the missing data for prediction in the parameter block?我应该如何表示data块中的可观察数据和parameter块中用于预测的缺失数据?

Thank you in advance!先感谢您!

EDIT:编辑:

Now I am implementing the model as below:现在我正在实现模型如下:

pmf_code = """
data {

int<lower=0> K; //number of factors
int<lower=0> N; //number of user
int<lower=0> M; //number of item
int<lower=0> D; //number of observation
int<lower=0> D_new; //number of pridictor 
int<lower=0, upper=N> ii[D]; //item 
int<lower=0, upper=M> jj[D]; //user
int<lower=0, upper=N> ii_new[D_new]; // item
int<lower=0, upper=N> jj_new[D_new]; // user
real<lower=0, upper=5> r[D]; //rating
real<lower=0, upper=5> r_new[D_new]; //pridict rating

}

parameters {
row_vector[K] i[M]; // item profile
row_vector[K] u[N]; // user profile
real<lower=0> alpha;
real<lower=0> alpha_i;
real<lower=0> alpha_u;

}
transformed parameters {
matrix[N,M] I; // indicator variable
I <- rep_matrix(0, N, M);
for (d in 1:D){
    I[ii[d]][jj[d]] <- 1;
}
}
model {
for (d in 1:D){
    r[d] ~ normal(u[jj[d]]' * i[ii[d]], 1/alpha);
}

for (n in 1: N){
    u[n] ~ normal(0,(1/alpha_u) * I);
}
for (m in 1:M){
    i[m] ~ normal(0,(1/alpha_i) * I);
}
}
generated_quantities{
for (d in 1:D_new){
    r_new[d] <- normal(u[jj_new[d]]' * i[ii_new[d]], 1/alpha);
}
}
"""     

but got an No matches for: real ~ normal(matrix, real) error in this line of code:但在这行代码中得到了一个No matches for: real ~ normal(matrix, real)错误:

for (d in 1:D){
    r[d] ~ normal(u[jj[d]]' * i[ii[d]], 1/alpha);
}

But the jj[d] should be a integer, denoting the id of user .但是jj[d]应该是一个整数,表示user的 id。 And u[ int ] should be a row_vector has k factors and so is i[ii[d]] .并且 u[ int ] 应该是一个row_vectork因子, i[ii[d]] The product of them should be a single real value, why stan said it was a matrix ?它们的乘积应该是一个单一的实数值,为什么 stan 说它是一个matrix

There's a chapter in the Stan manual on how to deal with missing or sparse data. Stan 手册中有关于如何处理丢失或稀疏数据的章节。 In this case, it's missing data.在这种情况下,它缺少数据。 What you want to do is put it in long form (what R's reshape package calls melted form):你想要做的是把它变成长形式(R 的 reshape 包称为熔化形式):

  int<lower=0> I;               // number of items
  int<lower=0> J;               // number of users
  int N;                        // number of observations
  int<lower=1, upper=I> ii[N];  // item 
  int<lower=1, upper=J> jj[N];  // user
  int<lower=0, upper=5> y[N];   // rating

Then, for each observation n , you have user jj[n] assigning the rating y[n] to item ii[n] .然后,对于每个观察n ,您让用户jj[n]将评级y[n]分配给项目ii[n]

There's an example of this in the IRT models in the regression section of the manual.在手册的回归部分的 IRT 模型中有一个这样的例子。 But you have an ordinal outcome, which is a bit trickier.但是你有一个有序的结果,这有点棘手。 You could either do a direct ordinal logistic of some kind, probably hierarchical, or you could try to do something like a factor model (like the partial SVD everyone used for Netflix).你可以做某种直接的序数逻辑,可能是分层的,或者你可以尝试做一些类似因子模型的事情(比如每个人都用于 Netflix 的部分 SVD)。 There are also example of factor models in the manual --- you'd use those to generate the linear predictor for an ordinal regression.手册中还有因子模型的示例——您可以使用它们来生成有序回归的线性预测器。

Then, if you want to predict y[m] for some new combination of item i and user j , you can do that in the generated quantities block as a posterior predictive quantity.然后,如果您想为项目i和用户j某些新组合预测y[m] ,您可以在生成量块中将其作为后验预测量。 And you can do that either via sampling or via an expectation;你可以通过抽样或期望来做到这一点; there's an example of that in the change-point model in the latent discrete parameter chapter and also in the regression chapter on prediction.在潜在离散参数章节的变化点模型和关于预测的回归章节中有一个例子。

Stan has neither a missing data symbol nor the ability to estimate discrete unknowns, so what you are proposing is almost impossible and not a great entry point for learning Stan. Stan 既没有缺失的数据符号,也没有估计离散未知数的能力,所以你所提出的几乎是不可能的,也不是学习 Stan 的一个很好的切入点。 This is explained in the Stan User Manual.这在 Stan 用户手册中有解释。

In principle, you could pass in the non-missing data and a two-dimensional integer array that is 0 if the item is missing for a user and 1 if the item is observed for that user.原则上,您可以传入非缺失数据和二维整数数组,如果用户缺少该项目,则该数组为 0,如果该用户观察到该项目,则该数组为 1。 Then you need to declare a latent utility for each user and item, constrain them to fall between the right two cutpoints if the data point is observed, and adjust for the absolute value of the derivative of the transformation you use to get the latent utility between the cutpoints.然后,您需要为每个用户和项目声明一个潜在效用,如果观察到数据点,则将它们限制在正确的两个切割点之间,并调整用于获得之间的潜在效用的转换导数的绝对值切点。 If the data point is missing, then the corresponding latent utility is unconstrained.如果数据点丢失,则相应的潜在效用不受约束。 This is essentially the data augmentation approach used by Gibbs samplers, although Stan is not a Gibbs sampler.这本质上是 Gibbs 采样器使用的数据增强方法,尽管 Stan 不是 Gibbs 采样器。 Then, you specify your model for the latent utilities (constraining the scale of the errors to be 1) and hope for the best.然后,您为潜在效用指定模型(将误差范围限制为 1)并希望获得最佳效果。 Most likely there will be a lot of divergent transitions, which will require that you set adapt_delta quite close to 1 to eliminate them.很可能会有很多不同的转换,这需要您将adapt_delta设置adapt_delta非常接近 1 以消除它们。

The closest thing we have to an example of this approach is a multivariate probit model but that is for the simpler case of binary outcomes.我们对这种方法的一个例子最接近的是一个多元概率模型,但这是针对二元结果的更简单的情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM