简体   繁体   English

如何在 python 中使用 t 检验方法计算 t 和 p 值?

[英]How to calculate t and p values with t-test approach in python?

I have a dataset that measures the expression levels of large numbers of genes simultaneously.我有一个数据集,可以同时测量大量基因的表达水平。

Here is some part of my data frame这是我的数据框的一部分

在此处输入图像描述

Column 0 refers to gene types and other columns are patient samples.第 0 列是指基因类型,其他列是患者样本。 Samples in the datasets represent patients.数据集中的样本代表患者。 For each patient, 7070 genes expressions (values) are measured in order to classify the patient's disease into one of the following cases: EPD, JPA, MED, MGL, RHB.对于每位患者,测量 7070 个基因表达(值)以将患者的疾病分类为以下病例之一:EPD、JPA、MED、MGL、RHB。

I would like to generate subsets with top 2,4,6,8,10,12,15,20,25, and 30 top genes with the highest absolute T-value for each class.我想为每个 class 生成具有最高绝对 T 值的前 2、4、6、8、10、12、15、20、25 和 30 个顶级基因的子集。

I try to use scipy.stats.ttest_ind for every possible pair.我尝试对每个可能的配对使用 scipy.stats.ttest_ind。

def calculate_t():
t_res = []
for cls in range(np.max(classes)):
    samp = np.where(classes == cls)[0]
    for gene in range(train.shape[1]):
        for other_genes in range(gene, train.shape[1]):
            t_res.append(ttest_ind(train[samp, gene], train[samp, other_genes])[:])

return t_res

I didn't continue because I thought it would take too long.我没有继续,因为我认为这需要太长时间。

I would appreciate it if anyone has any ideas.如果有人有任何想法,我将不胜感激。 Have a nice day.祝你今天过得愉快。

I'll try not to get to stats-heavy in my answer since Stack Overflow is meant to focus of the technical issues, but there are pretty large theoretical problems with carrying out multiple-tests.由于 Stack Overflow 旨在关注技术问题,因此我会尽量不在我的回答中使用大量统计数据,但是在执行多重测试时存在相当大的理论问题。 In short, a p-value of 0.05 is normally required for acceptance, meaning the chance of it occurring if the null hypothesis is true, is only 5%.简而言之,接受通常需要 0.05 的 p 值,这意味着如果 null 假设为真,它发生的机会仅为 5%。 If you carry out lots of similar tests, the chance that one of them will reject the null hypothesis becomes much more likely.如果您进行大量类似的测试,其中一个会拒绝 null 假设的可能性变得更大。

Think of it as if you where rolling a dice to get a six- there's only a one in six chance for each roll, but if you roll a hundred times, it's more or less guaranteed that lots of your rolls will be six (even though this is unlikely in a given through).把它想象成你在哪里掷骰子得到六——每次掷骰的机会只有六分之一,但是如果你掷一百次,或多或少可以保证你的很多掷骰都是六(即使这在给定的情况下不太可能)。

Rather than optimising your code to carry out multiple t-tests, it might be worth looking at alternative tests for significance that are designed to work across multiple comparisons.与其优化您的代码以执行多个 t 检验,不如研究旨在跨多个比较工作的替代性检验的重要性。

Scipy has an ANOVA test you can use for significance across multiple comparisons, like so: Scipy 有一个 ANOVA 测试,您可以在多重比较中用于显着性,如下所示:

stats.f_oneway(df['sample_one'], df['sample_two'], df['sample_three'])

Although this will just give you the F and P value of the overall test.尽管这只会为您提供整体测试的 F 和 P 值。 If you want to break down into more detail, it's probably worth looking into other test such as a Tukey Test which is supported by the statsmodels module.如果您想分解更多细节,可能值得研究其他测试,例如 statsmodels 模块支持的 Tukey 测试。 You can find a helpful guide on carrying it out here.您可以在此处找到有关执行此操作的有用指南。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM