[英]How scipy.stats handles nans?
I am trying to do some statistics in Python. 我试图在Python中做一些统计。 I have data with several missing values, filled with
np.nan
, and I am not sure should I remove it manually, or scipy can handle it. 我有几个缺少值的数据,填充
np.nan
,我不确定我应该手动删除它,或scipy可以处理它。 So I tried both: 所以我尝试了两个:
import scipy.stats, numpy as np
a = [0.75, np.nan, 0.58337, 0.75, 0.75, 0.91663, 1.0, np.nan, 0.663, 0.837, 0.837, 1.0, 0.663, 1.0, 1.0, 0.91663, 0.75, 0.41669, 0.58337, 0.663, 0.75, 0.58337]
b = [0.837, np.nan, 0.663, 0.58337, 0.75, 0.75, 0.58337, np.nan, 0.166, 0.5, 0.663, 1.0, 0.91663, 1.0, 0.663, 0.75, 0.75, 0.41669, 0.331, 0.25, 1.0, 0.91663]
d_1, d_2 = a,b
wilc1 = scipy.stats.wilcoxon(d_1, d_2, zero_method = 'pratt')
d_1, d_2 = [], []
for d1, d2 in zip(a, b):
if np.isnan(d1) or np.isnan(d2):
pass
else:
d_1.append(d1)
d_2.append(d2)
wilc2 = scipy.stats.wilcoxon(d_1, d_2, zero_method = 'pratt')
print wilc1
print wilc2
I get two runtime warnigs: 我得到两个运行时warnigs:
C:\Python27\lib\site-packages\scipy\stats\morestats.py:1963: RuntimeWarning: invalid value encountered in greater
r_plus = sum((d > 0) * r, axis=0
and two Wilcoxon outputs 和两个Wilcoxon输出
(54.0, 0.018545881687477818)
(54.0, 0.056806600853965265)
As you see, I have two similiar test statisitcs (W), and two different P-values. 如您所见,我有两个类似的测试统计(W)和两个不同的P值。 Which is one is correct?
哪个是正确的?
My guess, that Wilcoxon processes missing values correctly during test statistic calculation, but during P-value calculation, it uses len() of all data, not just valid cases. 我的猜测是,Wilcoxon在测试统计计算过程中正确处理缺失值,但在P值计算期间,它使用所有数据的len(),而不仅仅是有效的情况。 Can this count as bug?
这可算作虫子吗?
You can not mathematically perform a test statistic based on nan. 您无法在数学上执行基于nan的测试统计。 Unless you find proof/documentation of special treatment of nan, you can not rely on that.
除非你找到纳特的特殊处理证明/文件,否则你不能依赖它。
My experience is that in general, even numpy does not treat nan specially, for example for median. 我的经验是,一般来说,即使是numpy也不会特别对待nan,例如中位数。 Instead the results are whatever they happen to be, as a result of the algorithm implementation.
相反,结果是它们碰巧发生的任何结果,这是算法实现的结果。
For example, numpy.median() seems to end up treating nan as inf, placing nan above the median. 例如,numpy.median()似乎最终将nan视为inf,将nan置于中位数之上。 This is likely just a side effect of the results of
a<b
comparisons always being false for nan. 这可能只是
a<b
比较结果对于nan总是假的结果的副作用。 A similar effect is probably behind your two identical test statistic values W. 类似的效果可能在您的两个相同的测试统计值W之后。
Also note: There are a few method variants in numpy, such as http://docs.scipy.org/doc/numpy/reference/generated/numpy.nanmean.html 另请注意:numpy中有一些方法变体,例如http://docs.scipy.org/doc/numpy/reference/generated/numpy.nanmean.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.