如何提高数百个文件中数千行的解析效率

Question

我写了一个脚本，但速度太慢了。 我想知道是否有人可以建议如何加快速度。 脚本中我觉得太慢的部分是这样的：

我有一个包含 1,000 个人类基因名称的列表（每个基因名称都是一个数字），读入一个名为“ListOfHumanGenes”的列表； 例如，列表的开头如下所示：
[2314,2395,10672,8683,5075]

我有 100 个这样的文件，所有文件的扩展名都是“.HumanHomologs”：

 HumanGene OriginalGene Intercept age pval 2314 14248 5.3e-15 0.99 3.5e-33 2395 14297 15.76 -0.05 0.59 10672 14674 7.25 0.19 0.58 8683 108014 21.63 -1.74 0.43 5075 18503 -6.34 1.58 0.19

这部分脚本的算法是说（英文，不是代码）：

 for each gene in ListOfHumanGenes: open each of the 100 files labelled ".HumanHomologs" if the gene name is present: NumberOfTrials +=1 if the p-val is <0.05: if the "Age" column < 0: UnderexpressedSuccess +=1 elif "Age" column > 0: OverexpressedSuccess +=1 print each_gene + "\\t" + NumberOfTrials + "\\t" UnderexpressedSuccess print each_gene + "\\t" + NumberOfTrials + "\\t" OverexpressedSuccess

本节的代码是：

for each_item in ListOfHumanGenes:
    OverexpressedSuccess = 0
    UnderexpressedSuccess = 0
    NumberOfTrials = 0
    for each_file in glob.glob("*.HumanHomologs"):
        open_each_file = open(each_file).readlines()[1:]
        for line in open_each_file:
            line = line.strip().split()
            if each_item == line[0]:
                NumberOfTrials +=1    #i.e if the gene is in the file, NumberOfTrials +=1. Not every gene is guaranteed to be in every file
                if line[-1] != "NA":
                    if float(line[-1]) < float(0.05):
                        if float(line[-2]) < float(0):
                            UnderexpressedSuccess +=1
                        elif float(line[-2]) > float(0):
                            OverexpressedSuccess +=1

    underexpr_output_file.write(each_item + "\t" + str(UnderexpressedSuccess) + "\t" + str(NumberOfTrials) + "\t" + str(UnderProbability) +"\n") #Note: the "Underprobabilty" float is obtained earlier in the script
    overexpr_output_file.write(each_item + "\t" + str(OverexpressedSuccess) + "\t" + str(NumberOfTrials) + "\t" + str(OverProbability) +"\n") #Note: the "Overprobability" float is obtained earlier in the script
overexpr_output_file.close()
underexpr_output_file.close()

这会生成两个输出文件（一个用于过度表达，一个用于表达不足），如下所示； 列是 GeneName、#Overexpressed/#Underexpressed、#NumberTrials，然后可以忽略最后一列：

2314    8   100 0.100381689982
2395    14  90  0.100381689982
10672   10  90  0.100381689982
8683    8   98  0.100381689982
5075    5   88  0.100381689982

每个“.HumanHomologs”文件都有> 8,000 行，基因列表长约20,000 个基因。 所以我知道这很慢，因为对于 20,000 个基因中的每一个，它都会打开 100 个文件并在每个文件的 > 8,000 个基因中找到基因。 我想知道是否有人可以建议我可以进行编辑以使这个脚本更快/更有效？

Answer 1

您的算法将打开所有这 100 个文件 1000 次。 立即想到的优化是将文件作为最外层循环进行迭代，这将确保每个文件只打开一次。 然后检查每个基因的存在并记录您想要的任何其他记录。

此外，pandas 模块在处理这种 csv 文件时将非常方便。 看看熊猫

Answer 2

感谢您的帮助; 交换循环的洞察力是无价的。 改进的、更高效的脚本如下：（一个注意事项：我现在有一个 DictOfHumanGenes，而不是 ListOfHumanGenes（如上所述），其中每个键是人类基因，值是 ( 1）NumberOfTrials，（2）UnderexpressedSuccess 和（3）OverexpressedSuccess；这也加快了我代码的其他部分）：

for each_file in glob.glob("*.HumanHomologs"):
    open_each_file = open(each_file).readlines()[1:]
    for line in open_each_file:
        line = line.strip().split()
        if line[0] in DictOfHumanGenes: 
            DictOfHumanGenes[line[0]][0] +=1  #This is the Number of trials
            if line[-1] != "NA":
                if float(line[-1]) < float(0.05):
                    if float(line[-2]) < float(0):
                        DictOfHumanGenes[line[0]][1] +=1  #This is the UnexpressedSuccess
                    elif float(line[-2]) > float(0):
                        DictOfHumanGenes[line[0]][2] +=1  #This is the OverexpressedSuccess

我现在正在研究 pandas 以了解如何合并它，如果我可以使用 pandas 使代码更加高效，我将在此处发布答案。

如何提高数百个文件中数千行的解析效率

问题描述

2 个解决方案

解决方案1
1 2017-01-11 10:58:57

解决方案2
0 2017-01-11 11:59:07

如何提高数百个文件中数千行的解析效率

问题描述

2 个解决方案

解决方案1 1 2017-01-11 10:58:57

解决方案2 0 2017-01-11 11:59:07

解决方案1
1 2017-01-11 10:58:57

解决方案2
0 2017-01-11 11:59:07