简体   繁体   中英

Estimating parameters using stan when the distribution for response variable in a regression is non-normal - Part 2

This is an extension of my previous post here Estimating parameters using stan when the distribution for response variable in a regression is non-normal .

Let say I have below data

dat = list(y = c(0.00792354094929414, 0.00865300734292492, 0.0297400780486734, 
0.0196358416326437, 0.00239020640762042, 0.0258055591736283, 
0.17394835142698, 0.156463554455613, 0.329388185725557, 0.00764435088817635, 
0.0162081480398152, 0, 0.00157591399416963, 0.420025972703085, 
0.000122623651944455, 0.133061480234834, 0.565454216154227, 0.000281973481299731, 
0.000559715156383041, 0.0270686389659072, 0.918300537689865, 
0.00000782624683025728, 0.00732414341919458, 0, 0, 0, 0, 0, 0, 
0, 0.174071274611405, 0.0432109713717948, 0.0544400838264943, 
0, 0.0907049925221286, 0.616680102647887, 0, 0), x = c(23.8187587698947, 
15.9991138359515, 33.6495930512881, 28.555818797764, -52.2967967248258, 
-91.3835208788233, -73.9830692708321, -5.16901145289629, 29.8363012310241, 
10.6820057903939, 19.4868517164395, 15.4499668436458, -17.0441644773509, 
10.7025053739577, -8.6382953428539, -32.8892974839165, -15.8671863161348, 
-11.237248036145, -7.37978020066205, -3.33500586334862, -4.02629933182873, 
-20.2413384726948, -54.9094885578775, -48.041459120976, -52.3125732905322, 
-35.6269065970458, -62.0296155423529, -49.0825017152659, -73.0574478287598, 
-50.9409090127938, -63.4650928035253, -55.1263264283842, -52.2841103768755, 
-61.2275334149805, -74.2175990067417, -68.2961107804698, -76.6834643609286, 
-70.16769103228), N = 38)

I want to fit a logit model on above data based on fractional response variable . Therefore, below is my stan model code

model = "
data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}

transformed data {
  vector[N] z = bernoulli_rng(y);
}

parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}

transformed parameters {
    vector[N] mu;
    mu = alpha + beta * x;
}

model {
  sigma ~ normal(0, 1);
  alpha ~ normal(0, 1);
  beta ~ normal(0, 1);
  z ~ bernoulli(mu);
}

"
sampling(stan_model(model_code = model), data = dat, chains = 4, iter = 50000, refresh = 0)

With this I am getting below error

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Variable definition base type mismatch, variable declared as base type vector variable definition has base type int[ ] error in 'model93e37bdec88_3b62e3bb17b9f3ed9c717c98aa6ca9ac' at line 9, column 32
  -------------------------------------------------
     7: 
     8: transformed data {
     9:   vector[N] z = bernoulli_rng(y);
                                       ^
    10: }
  -------------------------------------------------

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'object' in selecting a method for function 'sampling': failed to parse Stan model '3b62e3bb17b9f3ed9c717c98aa6ca9ac' due to the above error.

Could you please help me to find the correct specification of the stan model?

There might be a deeper issue than how to model saturated probabilities (probabilities that either exactly 0 or exactly 1).

Here is a plot of your data. Visually there isn't much of a relationship between x and y .

library("tidyverse")

as_tibble(dat) %>%
  ggplot(
    aes(x, y)
  ) +
  geom_point() +
  scale_y_continuous(
    limits = c(0, 1)
  )

Created on 2022-03-13 by the reprex package (v2.0.1)

And things don't get better on the logit scale, ie, with the transformation z = logit(y) .

library("tidyverse")

as_tibble(dat) %>%
  # The transformation maps the saturated probabilities to NA.
  mutate(
    z = qlogis(y)
  ) %>%
  # And ggplot drops the NAs.
  ggplot(
    aes(x, z)
  ) +
  geom_point()

Created on 2022-03-13 by the reprex package (v2.0.1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM