简体   繁体   English

为八度伪造统计软件包功能为Anderson Darling测试创建CDF

[英]Create CDF for Anderson Darling test for Octave forge Statistics package function

I am using Octave and I would like to use the anderson_darling_test from the Octave forge Statistics package to test if two vectors of data are drawn from the same statistical distribution. 我正在使用Octave,我想使用Octave forge Statistics软件包中的anderson_darling_test来测试是否从同一统计分布中提取了两个数据向量。 Furthermore, the reference distribution is unlikely to be "normal". 此外,参考分布不太可能是“正态”的。 This reference distribution will be the known distribution and taken from the help for the above function " 'If you are selecting from a known distribution, convert your values into CDF values for the distribution and use "uniform'. 该参考分布将是已知分布,并从上述功能的帮助中获得“如果从已知分布中进行选择,则将值转换为CDF值以进行分布并使用“统一”。 "

My question therefore is: how would I convert my data values into CDF values for the reference distribution? 因此,我的问题是:如何将我的数据值转换为CDF值以进行参考分布?

Some background information for the problem: I have a vector of raw data values from which I extract the cyclic component (this will be the reference distribution); 该问题的一些背景信息:我有一个原始数据值向量,可以从中提取循环分量(这将是参考分布); I then wish to compare this cyclic component with the raw data itself to see if the raw data is essentially cyclic in nature. 然后,我希望将此循环组件与原始数据本身进行比较,以查看原始数据本质上是否本质上是循环的。 If the the null hypothesis that the two are the same can be rejected I will then know that most of the movement in the raw data is not due to cyclic influences but is due to either trend or just noise. 如果可以拒绝两个相同的零假设,那么我将知道原始数据中的大多数移动不是由于周期性影响,而是由于趋势或噪声。

If your data has a specific distribution, for instance beta(3,3) then 如果您的数据具有特定的分布,例如beta(3,3)

p = betacdf(x, 3, 3)

will be uniform by the definition of a CDF. 根据CDF的定义将是统一的。 If you want to transform it to a normal, you can just call the inverse CDF function 如果要将其转换为法线,只需调用CDF逆函数

x=norminv(p,0,1)

on the uniform p . 在统一p Once transformed, use your favorite test. 转换后,请使用您喜欢的测试。 I'm not sure I understand your data, but you might consider using a Kolmogorov-Smirnov test instead, which is a nonparametric test of distributional equality. 我不确定我是否理解您的数据,但是您可能会考虑使用Kolmogorov-Smirnov检验 ,它是分布相等性的非参数检验。

Your approach is misguided in multiple ways. 您的方法有多种误导。 Several points: 几点:

  • The Anderson-Darling test implemented in Octave forge is a one-sample test: it requires one vector of data and a reference distribution. 用Octave forge实现的Anderson-Darling测试是一个样本测试:它需要一个数据向量和一个参考分布。 The distribution should be known - not come from data. 分布应该是已知的-并非来自数据。 While you quote the help-file correctly about using a CDF and the "uniform" option for a distribution that is not built in, you are ignoring the next sentence of the same help file: 当您正确引用有关使用CDF的帮助文件和未内置发行版的“统一”选项时,您将忽略同一帮助文件的下一个句子:

Do not use "uniform" if the distribution parameters are estimated from the data itself, as this sharply biases the A^2 statistic toward smaller values. 如果分布参数是根据数据本身估算的,则不要使用“统一”,因为这会使A ^ 2统计信息明显偏向较小的值。

So, don't do it. 所以,不要这样做。

  • Even if you found or wrote a function implementing a proper two-sample Anderson-Darling or Kolmogorov-Smirnov test, you would still be left with a couple of problems: 即使您找到或编写了实现适当的两样本Anderson-Darling或Kolmogorov-Smirnov检验的函数,仍然会遇到一些问题:

    1. Your samples (the data and the cyclic part estimated from the data) are not independent, and these tests assume independence. 您的样本(数据和根据数据估算的循环部分)不是独立的,并且这些测试假定独立。

    2. Given your description, I assume there is some sort of time predictor involved. 根据您的描述,我认为其中涉及某种时间预测器。 So even if the distributions would coincide, that does not mean they coincide at the same time-points, because comparing distributions collapses over the time. 因此,即使分布会重合,也不意味着它们在同一时间点重合,因为比较分布会随时间崩溃。

    3. The distribution of cyclic trend + error would not expected to be the same as the distribution of the cyclic trend alone. 周期性趋势+误差的分布不会与单独的周期性趋势的分布相同。 Suppose the trend is sin(t). 假设趋势为sin(t)。 Then it never will go above 1. Now add a normally distributed random error term with standard deviation 0.1 (small, so that the trend is dominant). 然后它永远不会超过1。现在添加一个标准偏差为0.1(很小,因此趋势占主导地位)的正态分布随机误差项。 Obviously you could get values well above 1. 显然,您可以获得远远高于1的值。

We do not have enough information to figure out the proper thing to do, and it is not really a programming question anyway. 我们没有足够的信息来找出正确的方法,反正这也不是真正的编程问题。 Look up time series theory - separating cyclic components is a major topic there. 查找时间序列理论-分离循环分量是那里的一个主要话题。 But many reasonable analyses will probably be based on the residuals: (observed value - predicted from cyclic component). 但是许多合理的分析可能会基于残差:(观测值-由循环分量预测)。 You will still have to be careful about auto-correlation and other complexities, but at least it will be a move in the right direction. 您仍然必须对自相关和其他复杂性保持谨慎,但是至少这是朝正确方向发展的一步。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM