简体   繁体   English

如何表示分类预测变量 rstan?

[英]How to represent a categorical predictor rstan?

What is the proper way to format a categorical predictor to use in STAN?格式化要在 STAN 中使用的分类预测器的正确方法是什么? I cannot seem to input a categorical predictor as a normal factor variable, so what is the quickest way to transform a normal categorical variable such that Stan can accept it?我似乎无法输入分类预测变量作为正常因子变量,那么转换正常分类变量以使 Stan 可以接受的最快方法是什么?

For example, say I had aa continue predictor and a categorical predictor例如,假设我有一个继续预测变量和一个分类预测变量

your_dataset = data.frame(income = c(62085.59, 60806.33, 60527.27, 67112.64, 57675.92, 58128.44, 60822.47, 55805.80, 63982.99, 64555.45),
country = c("England", "England", "England", "USA", "USA", "USA", "South Africa", "South Africa", "South Africa", "Belgium"))

Which looks like this:看起来像这样:

     income      country
1  62085.59      England
2  60806.33      England
3  60527.27      England
4  67112.64          USA
5  57675.92          USA
6  58128.44          USA
7  60822.47 South Africa
8  55805.80 South Africa
9  63982.99 South Africa
10 64555.45      Belgium

How would I prepare this to be entered in rstan ?我将如何准备将其输入rstan

It is correct that Stan only inputs real or integeger variables. Stan 只输入实数或整数变量是正确的。 In this case, you want to convert a categorical predictor into dummy variables (perhaps excluding a reference category).在这种情况下,您希望将分类预测变量转换为虚拟变量(可能不包括参考类别)。 In R, you can do something like在 R 中,你可以做类似的事情

dummy_variables <- model.matrix(~ country, data = your_dataset)

Which will look like this看起来像这样

   (Intercept) countryEngland countrySouth Africa countryUSA
1            1              1                   0          0
2            1              1                   0          0
3            1              1                   0          0
4            1              0                   0          1
5            1              0                   0          1
6            1              0                   0          1
7            1              0                   1          0
8            1              0                   1          0
9            1              0                   1          0
10           1              0                   0          0
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$country
[1] "contr.treatment"

However, that might not come out to the right number of observations if you have unmodeled missingness on some other variables.但是,如果您在某些其他变量上存在未建模的缺失,则可能无法得出正确数量的观测值。 This approach can be taken a step farther by inputting the entire model formula like通过输入整个模型公式,这种方法可以更进一步

X <- model.matrix(outcome ~ predictor1 + predictor2 ..., data = your_dataset)

Now, you have an entire design matrix of predictors that you can use in a .stan program with linear algebra, such as现在,您有一个完整的预测变量设计矩阵,可以在具有线性代数的 .stan 程序中使用,例如

data {
  int<lower=1> N;
  int<lower=1> K;
  matrix[N,K]  X;
  vector[N]    y;
}
parameters {
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(X * beta, sigma); // likelihood
  // priors
}

Utilizing a design matrix is recommended because it makes your .stan program reusable with different variations of the same model or even different datasets.建议使用设计矩阵,因为它使您的 .stan 程序可重复使用相同模型甚至不同数据集的不同变体。

Another approach is to use an index variable, in which case the Stan program would look like另一种方法是使用索引变量,在这种情况下,Stan 程序看起来像

data {
  int<lower = 1> N; // observations
  int<lower = 1> J; // levels
  int<lower = 1, upper = J> x[N];
  vector[N] y;      // outcomes
}
parameters {
  vector[J] beta;
  real<lower = 0> sigma;
}
model {
  y ~ normal(beta[x], sigma); // likelihood
  // priors 
}

and you would pass the data from R to Stan like你会把数据从 R 传递给 Stan,就像

list(N = nrow(my_dataset),
     J = nlevels(my_dataset$x),
     x = as.integer(my_dataset$x),
     y = my_dataset$y)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM