简体   繁体   中英

Pandas t-test using row as the arrays

I need to find a way to calculate a p-value for two sets of data, comparing each row in one DataFrame with the accompanying row in another DataFrame. For example, array1 would be the five items in row 300 (not including stdev and Ctrl average), and same for array2 with the five items in row 300.

df1:

       Pep Ctrl 1  Pep Ctrl 2   Pep Ctrl 3  Pep Ctrl 4  Pep Ctrl 5         stdev  Ctrl average
300    47591000.0         NaN   49576000.0  41288000.0  61727000.0  8.551730e+06  4.174675e+07
301     4305900.0   2670800.0          NaN         NaN   7338400.0  2.368407e+06  4.170877e+06
302    11466000.0   3799400.0          NaN  18552000.0  31661000.0  1.184124e+07  1.546393e+07
303    11255000.0   5402300.0   18337000.0  19706000.0  40286000.0  1.321849e+07  1.803413e+07

df2:

      MCI 1 vs Ctrl normalized  MCI 2 vs Ctrl normalized  MCI 3 vs Ctrl normalized  MCI 4 vs Ctrl normalized  MCI 5 vs Ctrl normalized         stdev
300               1.054045e+08              4.980206e+07              4.764870e+07              1.834201e+07              2.994124e+07  3.346473e+07
301               1.019931e+07              3.309509e+06              6.595145e+06              1.089385e+07                       NaN  3.508776e+06
302               3.288333e+07              6.953062e+06              1.430190e+07              4.988915e+06              2.310888e+07  1.162495e+07
303               3.332308e+07              1.682790e+07              2.951138e+07              9.474570e+06              2.965893e+07  1.014219e+07

I need to do a two-tailed t test with equal variances, and then add this as the last column. Alternatively, if SciPy has an option to just input the number of items, standard deviation, and average, this could also work.

This is what I tried:

group1 = [df1['Pep Ctrl 1'],df1['Pep Ctrl 2'],df1['Pep Ctrl 3'],df1['Pep Ctrl 4'],df1['Pep Ctrl 5']]
group2 = [df2['MCI 1 vs Ctrl normalized'], df2['MCI 2 vs Ctrl normalized'], df2['MCI 3 vs Ctrl normalized'], df2['MCI 4 vs Ctrl normalized'], df2['MCI 5 vs Ctrl normalized']]
ttest = stats.ttest_ind(a=group1,b=group2,axis = 1, equal_var = True)

Any help would be appreciated.

df1 constructor:

{'Pep Ctrl 1': [47591000.0, 4305900.0, 11466000.0, 11255000.0],
 'Pep Ctrl 2': [nan, 2670800.0, 3799400.0, 5402300.0],
 'Pep Ctrl 3': [49576000.0, nan, nan, 18337000.0],
 'Pep Ctrl 4': [41288000.0, nan, 18552000.0, 19706000.0],
 'Pep Ctrl 5': [61727000.0, 7338400.0, 31661000.0, 40286000.0],
 'stdev': [8551730.0, 2368407.0, 11841240.0, 13218490.0],
 'Ctrl average': [41746750.0, 4170877.0, 15463930.0, 18034130.0]}

df2 constructor:

{'MCI 1 vs Ctrl normalized': [105404500.0, 10199310.0, 32883330.0, 33323080.0],
 'MCI 2 vs Ctrl normalized': [49802060.0, 3309509.0, 6953062.0, 16827900.0],
 'MCI 3 vs Ctrl normalized': [47648700.0, 6595145.0, 14301900.0, 29511380.0],
 'MCI 4 vs Ctrl normalized': [18342010.0, 10893850.0, 4988915.0, 9474570.0],
 'MCI 5 vs Ctrl normalized': [29941240.0, nan, 23108880.0, 29658930.0],
 'stdev': [33464730.0, 3508776.0, 11624950.0, 10142190.0]}

You could use iterrows to iterate over df1 and compare each row with a corresponding row in df2 with the same index:

from scipy import stats
df2_cols = df2.columns.drop('stdev')
out = [stats.ttest_ind(df2.loc[i, df2_cols], row, equal_var=True, nan_policy='omit') 
       for i, row in df1.drop(columns=['stdev','Ctrl average']).iterrows()]

Output:

[Ttest_indResult(statistic=0.010483243999151896, pvalue=0.9919282503324176), 
 Ttest_indResult(statistic=1.2563264347346306, pvalue=0.26449954642964396), 
 Ttest_indResult(statistic=0.009874028613226149, pvalue=0.9923973079846519), 
 Ttest_indResult(statistic=0.6390907092148139, pvalue=0.5406265164807074)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM