简体   繁体   中英

How to represent a categorical predictor rstan?

What is the proper way to format a categorical predictor to use in STAN? I cannot seem to input a categorical predictor as a normal factor variable, so what is the quickest way to transform a normal categorical variable such that Stan can accept it?

For example, say I had aa continue predictor and a categorical predictor

your_dataset = data.frame(income = c(62085.59, 60806.33, 60527.27, 67112.64, 57675.92, 58128.44, 60822.47, 55805.80, 63982.99, 64555.45),
country = c("England", "England", "England", "USA", "USA", "USA", "South Africa", "South Africa", "South Africa", "Belgium"))

Which looks like this:

     income      country
1  62085.59      England
2  60806.33      England
3  60527.27      England
4  67112.64          USA
5  57675.92          USA
6  58128.44          USA
7  60822.47 South Africa
8  55805.80 South Africa
9  63982.99 South Africa
10 64555.45      Belgium

How would I prepare this to be entered in rstan ?

It is correct that Stan only inputs real or integeger variables. In this case, you want to convert a categorical predictor into dummy variables (perhaps excluding a reference category). In R, you can do something like

dummy_variables <- model.matrix(~ country, data = your_dataset)

Which will look like this

   (Intercept) countryEngland countrySouth Africa countryUSA
1            1              1                   0          0
2            1              1                   0          0
3            1              1                   0          0
4            1              0                   0          1
5            1              0                   0          1
6            1              0                   0          1
7            1              0                   1          0
8            1              0                   1          0
9            1              0                   1          0
10           1              0                   0          0
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$country
[1] "contr.treatment"

However, that might not come out to the right number of observations if you have unmodeled missingness on some other variables. This approach can be taken a step farther by inputting the entire model formula like

X <- model.matrix(outcome ~ predictor1 + predictor2 ..., data = your_dataset)

Now, you have an entire design matrix of predictors that you can use in a .stan program with linear algebra, such as

data {
  int<lower=1> N;
  int<lower=1> K;
  matrix[N,K]  X;
  vector[N]    y;
}
parameters {
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(X * beta, sigma); // likelihood
  // priors
}

Utilizing a design matrix is recommended because it makes your .stan program reusable with different variations of the same model or even different datasets.

Another approach is to use an index variable, in which case the Stan program would look like

data {
  int<lower = 1> N; // observations
  int<lower = 1> J; // levels
  int<lower = 1, upper = J> x[N];
  vector[N] y;      // outcomes
}
parameters {
  vector[J] beta;
  real<lower = 0> sigma;
}
model {
  y ~ normal(beta[x], sigma); // likelihood
  // priors 
}

and you would pass the data from R to Stan like

list(N = nrow(my_dataset),
     J = nlevels(my_dataset$x),
     x = as.integer(my_dataset$x),
     y = my_dataset$y)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM