在python中使用超几何检验

Question

我有两个基因列表，我计算它们之间的交集。
我需要计算假设的 p 值 - 这些列表的交集是偶然发生的。
我尝试使用费舍尔精确检验 ( scipy function ) 来实现它。
请注意，我需要一个单边的 p 值。

我的代码：

def main(gene_path1, gene_path2, pop_size):
    genes1 = pd.read_csv(gene_path1, sep='\n', header=None)
    genes2 = pd.read_csv(gene_path2, sep='\n', header=None)

    intersection = pd.merge(genes1, genes2, how='inner').drop_duplicates([0])

    len_genes1 = genes1[0].count()
    len_genes2 = genes2[0].count()
    len_intersection = intersection[0].count()

    oddsratio, pvalue = stats.fisher_exact([[len_genes1 - len_intersection, len_genes1], [len_genes2 - len_intersection, len_genes2]], alternative='less')

    print(f'Genes1 len: {len_genes1}, Genes2 len: {len_genes2}, Intersection: {len_intersection}, pvalue: {pvalue}')

为了简单起见，我使用了一个数字列表（不是基因）。

因为它太长了，我不会复制整个文件，但想象一下两个文件有很多随机数，用换行符分隔。
例如：

问题是 - 我如何确定我为渔民准确指定了 arguments function？ 根据我要检查的假设是否正确？
我怀疑我做错了，但我不确定为什么。 可能是错误的提示 - 我知道人口规模应该是相关的，但我不确定在哪里使用它以及如何使用它。
任何线索或见解将不胜感激。

更新：
我试图以不同的方式实现它。

from scipy.stats import hypergeom as hg
import pandas as pd
def main(gene_path1, gene_path2, pop_size):
    genes1 = pd.read_csv(gene_path1, sep='\n', header=None)
    genes2 = pd.read_csv(gene_path2, sep='\n', header=None)

    intersection = pd.merge(genes1, genes2, how='inner').drop_duplicates([0])

    len_genes1 = genes1[0].count()
    len_genes2 = genes2[0].count()
    len_intersection = intersection[0].count()
    pvalue = hg.cdf(int(len_intersection)-1, int(pop_size), int(len_genes1), int(len_genes2))
    print(f'Genes1 len: {len_genes1}, Genes2 len: {len_genes2}, Intersection: {len_intersection}, p value: {pvalue})

我只是想知道我是否在正确的位置获得了 arguments，我如何验证它？

Answer 1

这也应该有帮助： http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/practicals/go_statistics_td/go_statistics_td_2015.html

g = 75 ## Number of submitted genes
k = 59 ## Size of the selection, i.e. submitted genes with at least one annotation in GO biological processes
m = 611 ## Number of "marked" elements, i.e. genes associated to this biological process
N = 13588 ## Total number of genes with some annotation in GOTERM_BP_FAT.
n = N - m ## Number of "non-marked" elements, i.e. genes not associated to this biological process
x = 19 ## Number of "marked" elements in the selection, i.e. genes of the group of interest that are associated to this biological process

# Python
stats.hypergeom(M=N, 
                n=m, 
                N=k).sf(x-1)
# 4.989682834451419e-12

# R
phyper(q=x -1, m=m, n=n, k=k, lower.tail=FALSE)
# [1] 4.989683e-12

Answer 2

我想知道你是否还有同样的问题。 但是，我发现此链接对于确保您的超几何测试结果非常有用。 关于您的计算，您的结果必须等于累积概率：P(X < int(len_intersection))

在python中使用超几何检验

问题描述

2 个解决方案

解决方案1
4 2021-02-22 23:21:42

解决方案2
0 2021-01-11 14:10:32

在python中使用超几何检验

问题描述

2 个解决方案

解决方案1 4 2021-02-22 23:21:42

解决方案2 0 2021-01-11 14:10:32

解决方案1
4 2021-02-22 23:21:42

解决方案2
0 2021-01-11 14:10:32