简体   繁体   中英

T-Test in Scipy with NaN values

I have a problem with doing a t-test in scipy that's driving me slowly crazy. It should be simple to resolve, but nothing I do works and there's no solution I can find through extensive searching. I'm using Spyder on the latest distribution of Anaconda.

Specifically: I want to compare means between two columns––'Trait_A' and 'Trait_B'––in a pandas dataframe that I've imported from a csv file. Some of the values in one of the columns are 'Nan' ('Not a Number'). The default setting on the independent samples scipy t-test function doesn't accommodate 'NaN' values. However, setting the 'nan_policy' parameter to 'omit' should deal with this . Nevertheless, when I do, the test statistic and p value come back as 'NaN.' When I restrict the range of values covered to actual numbers, the test works fine. My data and code are below; can anyone suggest what I'm doing wrong? Thanks!

Data:

     Trait_A   Trait_B
0   1.714286  0.000000
1   4.275862  4.000000
2   0.500000  4.625000
3   1.000000  0.000000
4   1.000000  4.000000
5   1.142857  1.000000
6   2.000000  1.000000
7   9.416667  1.956522
8   2.052632  0.571429
9   2.100000  0.166667
10  0.666667  0.000000
11  2.333333  1.705882
12  2.768145       NaN
13  0.000000       NaN
14  6.333333       NaN
15  0.928571       NaN

My code:

import pandas as pd
import scipy.stats as sp
data= pd.read_csv("filepath/Data2.csv")
print (sp.stats.ttest_ind(data['Trait_A'], data['Trait_B'], nan_policy='omit'))      

My result:

Ttest_indResult(statistic=nan, pvalue=nan)

It seems like a bug. You can drop nan s before passing them to the t-test:

sp.stats.ttest_ind(data.dropna()['Trait_A'], data.dropna()['Trait_B'])
Ttest_indResult(statistic=0.88752464718609214, pvalue=0.38439692093551037)

The bug is in line 3885, in file scipy/scipy/stats/stats.py :

# check both a and b
contains_nan, nan_policy = (_contains_nan(a, nan_policy) or
                            _contains_nan(b, nan_policy))

must be

contains_nan             = (_contains_nan(a, nan_policy)[0] or
                            _contains_nan(b, nan_policy)[0])

swapping 'Trait_A' and 'Trait_B' in your case solve your problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM