简体   繁体   中英

probability of linear trend

I have got a small amount of sample ([10 16 11 16 26 17 16 16 15 13 15 14 12 12 14 20 14 12 16 21 13 13 14 16 17 18 16 14 16 23 24 12 13 13 15 16 15 14 14 16 20 17 17 15 23 18 12 19 12 11 19 17 14 18 15 23 30 24 16 14 22 17 17 17 17 20 19 27 17 36] ):

There are two models:

  • Model A – there is not linear trend, so the center of the noise histogram is the mean of the data.
  • Model B – there is linear trend, so the center of the noise histogram is the distance from a fitted linear trendline.

Obviously, I can choice the model with smaller sigma^2 to choose the better model. Which is apparently the (B). However, I am not confident there is really have a trend in the data, and not just the noise randomly happened like this. So, I made a Dickey-Fuller test on both model, and both under the 1% limit ('1%': -3.529, A: -5.282, B: -6.149 ) . Which telling me it is possible the (A) is the right model.

So I come to the question: What is the probability of (A) is the better model?

I tried to solve this like: I assume the noise is normally distributed, so I fit the best normal distribution on the sigma separately on (A) and (B). So, I got two models for the noise. After this, I have taken n (the original sample length) sample from these two models and I compared they sigma^2. If (A) model sigma^2 was smaller I increased the possibility the model (A) is better, if not decreased. I repeated this test a reasonable amount of time.

In Python code, probably more clear:

model_b_mu, model_b_sigma = stats.norm.fit(model_b['residual'])
model_a_mu, model_a_sigma = stats.norm.fit(model_a['residual'])

def compare_models(modela_mu, modela_sigma,  modelb_mu, modelb_sigma, length):
    repate = 20000

    modela_better = 0
    for i in range(repate):
        modela = np.random.normal(modela_mu, modela_sigma, size = length )
        modelb = np.random.normal(modelb_mu, modelb_sigma, size = length )

        # test which sigma^2 is smaller
        sigma_a = np.sum(np.sqrt(np.power(modela, 2)))
        sigma_b = np.sum(np.sqrt(np.power(modelb, 2)))
        if sigma_a < sigma_b:
            modela_better += 1

    return modela_better/repate

model_a_better = compare_models(model_a_mu, model_a_sigma, model_b_mu, model_b_sigma, len(model_a))
print(model_a_better)

Which gave me: 0.3152. I interpreted this result: If the noise is normally distributed, 31.52% of the probability that model (A) is better.

My question is: I am thinking right way? If not, why? And how should I solve the problem?

Ps: I am not statistician, more like programmer, so it is highly possible this all above solution is wrong. Therefore, I ask some confirmation.

This is a so-called model selection problem. There isn't a single right answer, although the most nearly correct way to go about it is via Bayesian inference. That is, to compute the posterior probability p(model | data) for each of the models under consideration (two or more). Note that the result of Bayesian inference is a probability distribution over models, not a single "this model is correct" selection; any subsequent result which depends on a model is to be averaged over the distribution over models. Note also that Bayesian inference requires a prior over the models, that is, it's required that you specify a probability for each model a priori, in the absence of data. This is a feature, not a bug.

Glancing at the problem as stated, it would probably be straightforward to work out the posterior probability for the two models you mention, but first you'll need to get somewhat familiar with the conceptual framework. A web search for Bayesian model inference should turn up a lot of resources. Also this question is more suitable for stats.stackexchange.com.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM