简体   繁体   English

如何改善循环Python脚本,在每个循环中针对不同条件涉及不同的数学运算?

[英]How can I improve my looping Python script, involving different mathematical operations for different conditions on each loop?

I am posting again as I had no luck trying to make the following script more efficient. 我再次发帖,因为我没有运气试图提高以下脚本的效率。 For more details, do check out my previous post , but the basic situation is as below. 有关更多详细信息,请查看我以前的文章 ,但基本情况如下。

I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles. 我写了一个脚本来计算分数,以及一系列遗传图谱的频率。

A genetic profile here consists of a combination of SNPs. 这里的遗传图谱由SNP的组合组成。 Each SNP has two alleles. 每个SNP有两个等位基因。 Hence, the input file for 3 SNPs is something like below, which shows all possible combinations of all alleles for all three SNPs. 因此,3个SNP的输入文件如下所示,其中显示了所有3个SNP的所有等位基因的所有可能组合。 This table was generated using itertool's product in another script: 该表是使用itertool的产品在另一个脚本中生成的:

    AA   CC   TT
    AT   CC   TT
    TT   CC   TT
    AA   CG   TT
    AT   CG   TT
    TT   CG   TT
    AA   GG   TT
    AT   GG   TT
    TT   GG   TT
    AA   CC   TA
    AT   CC   TA
    TT   CC   TA
    AA   CG   TA
    AT   CG   TA
    TT   CG   TA
    AA   GG   TA
    AT   GG   TA
    TT   GG   TA
    AA   CC   AA
    AT   CC   AA
    TT   CC   AA
    AA   CG   AA
    AT   CG   AA
    TT   CG   AA
    AA   GG   AA
    AT   GG   AA
    TT   GG   AA

I then have another file with a table containing weights and frequencies for the three SNPs, such as below: 然后,我得到另一个文件,该文件的表包含三个SNP的权重和频率,如下所示:

SNP1             A       T       1.25    0.223143551314     0.97273 
SNP2             C       G       1.07    0.0676586484738    0.3     
SNP3             T       A       1.08    0.0769610411361    0.1136  

The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. 列是SNP ID,风险等位基因,参考等位基因,OR,log(OR)和总体频率。 The weights are for the risk allele. 权重用于风险等位基因。

The main script takes these two files, and computes a score, based on the sum of log odds ratios for each risk allele in each SNP for each genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium. 主脚本采用这两个文件,并根据每个遗传特征中每个SNP中每个风险等位基因的对数优势比的总和,以及假设Hardy Weinberg平衡时基于等位基因频率相乘的频率,计算得分。

import sys

snp={}
riskall={}
weights={}
freqs={}    # effect allele, *MAY NOT BE MINOR ALLELE

pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)

# read in OR table
pos = 0
with open(sys.argv[1], 'r') as f:
    for line in f:
        snp[pos]=(line.split()[0])
        riskall[line.split()[0]]=line.split()[1]
        weights[line.split()[0]]=line.split()[4]
        freqs[line.split()[0]]=line.split()[pop]

        pos+=1



### compute scores for each combination
with open(sys.argv[2], 'r') as f:
    for line in f:
        score=0
        freq=1
        for j in range(len(line.split())):
            rsid=snp[j]
            riskallele=riskall[rsid]
            frequency=freqs[rsid]
            wei=weights[rsid]
            allele1=line.split()[j][0]
            allele2=line.split()[j][1]
            if allele2 != riskallele:      # homozygous for ref
                score+=0
                freq*=(1-float(frequency))*(1-float(frequency))
            elif allele1 != riskallele and allele2 == riskallele:  # heterozygous, be sure that A2 is risk allele!
                score+=float(wei)
                freq*=2*(1-float(frequency))*(float(frequency))
            elif allele1 == riskallele: # and allele2 == riskall[snp[j]]:      # homozygous for risk, be sure to limit risk to second allele!
                score+=2*float(wei)
                freq*=float(frequency)*float(frequency)

            if freq < float(sys.argv[3]):   # threshold to stop loop in interest of efficiency 
                break

        print(','.join(line.split()) + "\t" + str(score) + "\t" + str(freq))

I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. 我设置了一个变量,可以在其中指定一个阈值,以在频率变得极低时打破循环。 What improvements can be done to speed up the script? 为了加快脚本执行速度,可以做哪些改进?

I have tried using Pandas, which is still much slower, as I am not sure if vectorization is possible in this case. 我尝试使用Pandas,但速度仍然慢得多,因为我不确定在这种情况下是否可以进行矢量化。 I have issues installing Dask on my Unix server. 我在Unix服务器上安装Dask时遇到问题。 I have also made sure to use only Python dictionaries and not lists, and this gave a slight improvement. 我还确保只使用Python字典,而不使用列表,这做了些微改进。

The expected output from the above would be as such: 上面的预期输出将是这样的:

GG,AA,GG        0       0.000286302968304
GG,AA,GA        0.0769610411361 7.33845153414e-05
GG,AA,AA        0.153922082272  4.70243735491e-06
GG,AG,GG        0.0676586484738 0.00024540254426
GG,AG,GA        0.14461968961   6.29010131498e-05
GG,AG,AA        0.221580730746  4.03066058992e-06
GG,GG,GG        0.135317296948  5.25862594844e-05
GG,GG,GA        0.212278338084  1.34787885321e-05
GG,GG,AA        0.28923937922   8.63712983555e-07
GA,AA,GG        0.223143551314  0.0204250448374
GA,AA,GA        0.30010459245   0.00523530030129
GA,AA,AA        0.377065633586  0.000335475019306
GA,AG,GG        0.290802199788  0.0175071812892
GA,AG,GA        0.367763240924  0.00448740025824
GA,AG,AA        0.44472428206   0.000287550016548
GA,GG,GG        0.358460848262  0.00375153884769
GA,GG,GA        0.435421889398  0.000961585769624
GA,GG,AA        0.512382930534  6.16178606889e-05
AA,AA,GG        0.446287102628  0.364284082594
AA,AA,GA        0.523248143764  0.0933724543834
AA,AA,AA        0.6002091849    0.00598325294334
AA,AG,GG        0.513945751102  0.312243499367
AA,AG,GA        0.590906792238  0.0800335323286
AA,AG,AA        0.667867833374  0.00512850252286
AA,GG,GG        0.581604399576  0.0669093212928
AA,GG,GA        0.658565440712  0.0171500426418
AA,GG,AA        0.735526481848  0.00109896482633

EDIT: Added previous post link, along with expected output. 编辑:添加了以前的帖子链接,以及预期的输出。

Disclaimer: I did not test this, it is rather a pseudo-code. 免责声明:我没有对此进行测试,而是一个伪代码。

I provide some general ideas about what is slow/fast in programming and particularly in python: 我提供了一些关于编程缓慢/快速(特别是在python中)的一般想法:

You should try to move out of loops everything what is not changing in that loop. 您应该尝试将循环中未更改的所有内容都移出循环。 Also, in python, you should try to replace loops with comprehensions https://www.pythonforbeginners.com/basics/list-comprehensions-in-python 另外,在python中,您应该尝试将循环替换为https://www.pythonforbeginners.com/basics/list-comprehensions-in-python

[ expression for item in list if conditional ]

you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient 您应该尝试使用地图/过滤器功能(如果可能),并且还可以准备数据以使程序更高效

    rsid=snp[j]
    riskallele=riskall[rsid]

is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop ): 基本上是双重映射,如果可以这样创建snp结构(可以在最后一列中使用-1索引并摆脱pop ),则可以做得更好:

snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])} 
         for line in map(split,f)]

and your computing loop can become something like this: 并且您的计算循环可能变成这样:

### compute scores for each combination
stop = sys.argv[3]
with open(sys.argv[2], 'r') as f:
    for fline in f:
        score=0.0 # work with floats from the start
        freq=1.0
        line = fline.split() # do it only once

        for j,field in line:
            s=snp[j]
            riskallele=s["riskall"]
            frequency=s["freq"]
            wei=s["weight"]
            (allele1,allele2) = line[j]

            if allele2 != riskallele:      # homozygous for ref
                score+=0
                freq*=(1-frequency)*(1-frequency)
            elif allele1 != riskallele and allele2 == riskallele:  # heterozygous, be sure that A2 is risk allele!
                score+=wei
                freq*=2*(1-frequency)*frequency
            elif allele1 == riskallele: # and allele2 == riskallele:      # homozygous for risk, be sure to limit risk to second allele!
                score+=2*wei
                freq*=frequency*frequency

            if freq < stop):   # threshold to stop loop in interest of efficiency 
                break

        print(','.join(line.split()) + "\t" + str(score) + "\t" + str(freq))

The ultimate goal I would try to achieve is to convert it to some map/reduce form: 我想要达到的最终目标是将其转换为某些map / reduce形式:

the allele can have [A,C,G,T][A,C,G,T] 16 combinations and we test against it [A,C,G,T] this only 64 combinations so I can create a map in form [AC,C] -> score,freq_function and I can get rid of the whole if block. 等位基因可以有[A,C,G,T] [A,C,G,T] 16个组合,我们针对它进行测试[A,C,G,T]这64个组合,所以我可以创建一个形式的图[AC,C]-> score,freq_function,我可以摆脱整个if块。

Sometimes the best approach is to split the code to small functions, reorganize and then merge back. 有时最好的方法是将代码拆分为小功能,重新组织然后合并回去。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM