[英]Pandas t-test using row as the arrays
I need to find a way to calculate a p-value for two sets of data, comparing each row in one DataFrame with the accompanying row in another DataFrame. For example, array1 would be the five items in row 300 (not including stdev and Ctrl average), and same for array2 with the five items in row 300.我需要找到一种方法来计算两组数据的 p 值,将一个 DataFrame 中的每一行与另一个 DataFrame 中的相应行进行比较。例如,array1 将是第 300 行中的五个项目(不包括 stdev 和 Ctrl平均值),对于第 300 行中的五个项目的 array2 也是如此。
df1:
Pep Ctrl 1 Pep Ctrl 2 Pep Ctrl 3 Pep Ctrl 4 Pep Ctrl 5 stdev Ctrl average
300 47591000.0 NaN 49576000.0 41288000.0 61727000.0 8.551730e+06 4.174675e+07
301 4305900.0 2670800.0 NaN NaN 7338400.0 2.368407e+06 4.170877e+06
302 11466000.0 3799400.0 NaN 18552000.0 31661000.0 1.184124e+07 1.546393e+07
303 11255000.0 5402300.0 18337000.0 19706000.0 40286000.0 1.321849e+07 1.803413e+07
df2:
MCI 1 vs Ctrl normalized MCI 2 vs Ctrl normalized MCI 3 vs Ctrl normalized MCI 4 vs Ctrl normalized MCI 5 vs Ctrl normalized stdev
300 1.054045e+08 4.980206e+07 4.764870e+07 1.834201e+07 2.994124e+07 3.346473e+07
301 1.019931e+07 3.309509e+06 6.595145e+06 1.089385e+07 NaN 3.508776e+06
302 3.288333e+07 6.953062e+06 1.430190e+07 4.988915e+06 2.310888e+07 1.162495e+07
303 3.332308e+07 1.682790e+07 2.951138e+07 9.474570e+06 2.965893e+07 1.014219e+07
I need to do a two-tailed t test with equal variances, and then add this as the last column.我需要做一个等方差的双尾 t 检验,然后将其添加为最后一列。 Alternatively, if SciPy has an option to just input the number of items, standard deviation, and average, this could also work.或者,如果 SciPy 可以选择仅输入项目数、标准差和平均值,这也可以。
This is what I tried:这是我试过的:
group1 = [df1['Pep Ctrl 1'],df1['Pep Ctrl 2'],df1['Pep Ctrl 3'],df1['Pep Ctrl 4'],df1['Pep Ctrl 5']]
group2 = [df2['MCI 1 vs Ctrl normalized'], df2['MCI 2 vs Ctrl normalized'], df2['MCI 3 vs Ctrl normalized'], df2['MCI 4 vs Ctrl normalized'], df2['MCI 5 vs Ctrl normalized']]
ttest = stats.ttest_ind(a=group1,b=group2,axis = 1, equal_var = True)
Any help would be appreciated.任何帮助,将不胜感激。
df1
constructor: df1
构造函数:
{'Pep Ctrl 1': [47591000.0, 4305900.0, 11466000.0, 11255000.0],
'Pep Ctrl 2': [nan, 2670800.0, 3799400.0, 5402300.0],
'Pep Ctrl 3': [49576000.0, nan, nan, 18337000.0],
'Pep Ctrl 4': [41288000.0, nan, 18552000.0, 19706000.0],
'Pep Ctrl 5': [61727000.0, 7338400.0, 31661000.0, 40286000.0],
'stdev': [8551730.0, 2368407.0, 11841240.0, 13218490.0],
'Ctrl average': [41746750.0, 4170877.0, 15463930.0, 18034130.0]}
df2
constructor: df2
构造函数:
{'MCI 1 vs Ctrl normalized': [105404500.0, 10199310.0, 32883330.0, 33323080.0],
'MCI 2 vs Ctrl normalized': [49802060.0, 3309509.0, 6953062.0, 16827900.0],
'MCI 3 vs Ctrl normalized': [47648700.0, 6595145.0, 14301900.0, 29511380.0],
'MCI 4 vs Ctrl normalized': [18342010.0, 10893850.0, 4988915.0, 9474570.0],
'MCI 5 vs Ctrl normalized': [29941240.0, nan, 23108880.0, 29658930.0],
'stdev': [33464730.0, 3508776.0, 11624950.0, 10142190.0]}
You could use iterrows
to iterate over df1
and compare each row with a corresponding row in df2
with the same index:您可以使用iterrows
迭代df1
并将每一行与df2
中具有相同索引的相应行进行比较:
from scipy import stats
df2_cols = df2.columns.drop('stdev')
out = [stats.ttest_ind(df2.loc[i, df2_cols], row, equal_var=True, nan_policy='omit')
for i, row in df1.drop(columns=['stdev','Ctrl average']).iterrows()]
Output: Output:
[Ttest_indResult(statistic=0.010483243999151896, pvalue=0.9919282503324176),
Ttest_indResult(statistic=1.2563264347346306, pvalue=0.26449954642964396),
Ttest_indResult(statistic=0.009874028613226149, pvalue=0.9923973079846519),
Ttest_indResult(statistic=0.6390907092148139, pvalue=0.5406265164807074)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.