简体   繁体   English

同时运行多个但单独的假设

[英]running multiple but separate hypothesis at the same time

here am using ztest built-in function within statsmodels to do single hypothesis test , however If I want to run many separate hypothesis tests - on many different columns - to test say the difference between two medians or two means , then it becomes cumbersome when doing it one by one , Is there faster and efficient way (memory and time wise) to run n number of these tests , to be more specific, say we have a dataframe of n columns , and I wanna test the difference between a mean or median return of certain trading days or (sequence of them) for a certain ticker versus the overall mean of that ticker over some period of time say 5 years (with daily values), now in the standard case , one would use这里我使用ztest中的statsmodels内置函数来进行单一假设检验,但是如果我想在许多不同的columns上运行许多单独的假设检验来测试说两个medians或两个means之间的差异,那么在做的时候会变得很麻烦一个接一个,是否有更快更有效的方法(内存和时间)来运行n个这些测试,更具体地说,假设我们有一个 n columnsdataframe ,我想测试平均值或中值之间的差异某些交易日或(它们的序列)对于某个股票的回报与该股票在一段时间内的总体平均值,比如 5 年(每日值),现在在标准情况下,人们会使用

from statsmodels.stats.weightstats import ztest

ztest_Score, p_value = ztest(df_altenative['symbol is here'], df_null , alternative='two-sided')

where of course df_null above is scalar quantity(say daily average return for the entire period), and df_alternative is a column within a larger dataframe of tickers , and it holds the mean or median of your sequence trading days , then , how one can do this iterative procedure in just one line of code if possible where it goes over each one of these separate columns within my data frame and the corresponding associated mean or median value and compare them to decide on which hypothesis to be rejected or not ?当然,上面的df_null量(比如整个期间的每日平均回报),而df_alternative是更大的代码dataframe框内的一column ,它包含您的序列交易日的平均值或中位数,然后,如何做如果可能的话,这个迭代过程仅在一行代码中遍历我的数据框中这些单独的列中的每一列以及相应的相关均值或中值,并比较它们以决定要拒绝哪个假设?

best regards此致

First, the one-sample hypothesis test is vectorized.首先,单样本假设检验是矢量化的。 Here I assume the value under the null is 0:这里我假设null下的值为0:

from statsmodels.stats.weightstats import ztest
x = np.random.randn(100, 4)
​
ztest_Score, p_value = ztest(x, value=0 , alternative='two-sided')
ztest_Score, p_value
(array([1.69925429, 0.5359994 , 0.05777533, 0.78699997]),
 array([0.08927128, 0.59195896, 0.95392759, 0.43128188]))

[ztest(x[:, i], value=0 , alternative='two-sided') for i in range(x.shape[1])]
[(1.699254292717283, 0.0892712806133958),
 (0.5359994032597257, 0.5919589628688362),
 (0.057775326408478586, 0.953927592014832),
 (0.7869999680163862, 0.43128188488265284)]

Second, the two sample test is vectorized with appropriate numpy broadcasting.其次,使用适当的 numpy 广播对两个样本测试进行矢量化。 The following compares each column of the first sample to the second sample y ,下面将第一个样本的每一列与第二个样本y进行比较,

y = np.random.randn(100)
statistic, p_value = ztest(x, y, alternative='two-sided')
statistic, p_value
(array([1.36445473, 0.50622444, 0.15362677, 0.64741684]),
 array([0.17242449, 0.6126991 , 0.87790403, 0.5173622 ]))

[ztest(x[:, i], y, alternative='two-sided') for i in range(x.shape[1])]
[(1.364454734896, 0.17242449122265047),
 (0.5062244362943313, 0.6126991023616855),
 (0.15362676881725684, 0.8779040290306083),
 (0.6474168385742498, 0.5173622008385331)]

statistic, p_value = ztest(x, y[:, None], alternative='two-sided')
statistic, p_value
(array([1.36445473, 0.50622444, 0.15362677, 0.64741684]),
 array([0.17242449, 0.6126991 , 0.87790403, 0.5173622 ]))

To case in the question:以案例为例:

The two sample case cannot have a single observation in one of the samples.两个样本案例不能在其中一个样本中有一个观察值。 The ztest needs to compute the variance for the samples to compute the inferential statistics like p-values. ztest 需要计算样本的方差,以计算 p 值等推断统计信息。 Specifically, the ztest (or ttest) needs to compute the standard error of the mean estimate of both samples.具体来说,ztest(或 ttest)需要计算两个样本的平均估计的标准误差。 This depends on the sample sizes.这取决于样本量。 If a sample has only a single observation, then pooled variance is used but the standard error of the mean will be very large.如果样本只有一个观察值,则使用合并方差,但均值的标准误差将非常大。

So, the option is to use either the one-sample z-test, which assumes that the second "mean" has no uncertainty, or to use the two sample test with the full data series as second sample, which will compute the standard error of its mean from the sample.因此,可以选择使用单样本 z 检验,假设第二个“平均值”没有不确定性,或者使用具有完整数据系列的双样本检验作为第二个样本,这将计算标准误差来自样本的平均值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM