简体   繁体   中英

How to calculate t and p values with t-test approach in python?

I have a dataset that measures the expression levels of large numbers of genes simultaneously.

Here is some part of my data frame

在此处输入图像描述

Column 0 refers to gene types and other columns are patient samples. Samples in the datasets represent patients. For each patient, 7070 genes expressions (values) are measured in order to classify the patient's disease into one of the following cases: EPD, JPA, MED, MGL, RHB.

I would like to generate subsets with top 2,4,6,8,10,12,15,20,25, and 30 top genes with the highest absolute T-value for each class.

I try to use scipy.stats.ttest_ind for every possible pair.

def calculate_t():
t_res = []
for cls in range(np.max(classes)):
    samp = np.where(classes == cls)[0]
    for gene in range(train.shape[1]):
        for other_genes in range(gene, train.shape[1]):
            t_res.append(ttest_ind(train[samp, gene], train[samp, other_genes])[:])

return t_res

I didn't continue because I thought it would take too long.

I would appreciate it if anyone has any ideas. Have a nice day.

I'll try not to get to stats-heavy in my answer since Stack Overflow is meant to focus of the technical issues, but there are pretty large theoretical problems with carrying out multiple-tests. In short, a p-value of 0.05 is normally required for acceptance, meaning the chance of it occurring if the null hypothesis is true, is only 5%. If you carry out lots of similar tests, the chance that one of them will reject the null hypothesis becomes much more likely.

Think of it as if you where rolling a dice to get a six- there's only a one in six chance for each roll, but if you roll a hundred times, it's more or less guaranteed that lots of your rolls will be six (even though this is unlikely in a given through).

Rather than optimising your code to carry out multiple t-tests, it might be worth looking at alternative tests for significance that are designed to work across multiple comparisons.

Scipy has an ANOVA test you can use for significance across multiple comparisons, like so:

stats.f_oneway(df['sample_one'], df['sample_two'], df['sample_three'])

Although this will just give you the F and P value of the overall test. If you want to break down into more detail, it's probably worth looking into other test such as a Tukey Test which is supported by the statsmodels module. You can find a helpful guide on carrying it out here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM