Iterating Over Two Large Lists using Python

Question

I have two files both of which are tab delimited. One of the file is almost 800k lines and it is a An Exonic Coordinates file and the other file is almost 200k lines (It is a VCF File).

I am writing a code in python to find and filter the position in the VCF that is within an exonic coordinates (Exon Start and End from Exonic Coordinates File) and writes it to a file.

However, because the files are big, it took a couple of days to get the filtrated output file?

So the code below is partially solve the issue of speed but the problem is to figure out is to speed the filtration process which is why I used a break to exit the second loop and I want to start from the beginning of the outer loop instead taking the next element from the first loop (outer loop)?

Here is my code:

import
import sys
list_coord = []
with open('ref_ordered.txt', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t')
    for row in reader:
                 list_coord.append((row[0],row[1],row[2]))

    def parseVcf(vcf,src):
        done = False
        with open(vcf,'r') as f:
                    reader=csv.reader((f),delimiter='\t')
                    vcf_out_split = vcf.split('.')
                    vcf_out_split.insert(2,"output_CORRECT2")
                    outpt = open('.'.join(vcf_out_split),'a')
                    for coord in list_coord:
                            for row in reader:
                                   if '#' not in row[0]:
                                            coor_genom = int(row[1])
                                            coor_exon1 = int(coord[1])+1
                                            coor_exon2 = int(coord[2])
                                            coor_genom_chr = row[0]
                                            coor_exon_chr = coord[0]
                                            ComH = row[7].split(';')
                                            for x in ComH:
                                               if 'DP4=' in x:
                                                 DP4_split=x[4:].split(',')
                                                 if (coor_exon1 <= coor_genom <= coor_exon2):
                                                    if (coor_genom_chr == coor_exon_chr):
                                                       if ((int(DP4_split[2]) >= 1 and int(DP4_split[3]) >= 1)):
                                                         done = True

                                                         outpt.write('\t'.join(row) + '\n')

                                            if done:
                                                    break
                    outpt.close()
for root,dirs,files in os.walk("."):
    for file in files:
      pathname=os.path.join(root,file)
      if file.find("1_1")==0:
        print "Parsing " + file
        parseVcf(pathname, "1_1")

ref_ordered.txt:

1   69090   70008
1   367658  368597
1   621095  622034
1   861321  861393
1   865534  865716
1   866418  866469
1   871151  871276
1   874419  874509

1_1 Input File:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT     directory
1   14907   rs79585140  A   G   20  .   DP=10;VDB=5.226464e-02;RPB=-6.206015e-01;AF1=0.5;AC1=1;DP4=1,2,5,2;MQ=32;FQ=20.5;PV4=0.5,0.07,0.16,0.33;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.312;CADD=1.415;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=61 GT:PL:GQ    0/1:50,0,51:50
1   14930   rs75454623  A   G   44  .   DP=9;VDB=7.907652e-02;RPB=3.960091e-01;AF1=0.5;AC1=1;DP4=1,2,6,0;MQ=41;FQ=30.9;PV4=0.083,1,0.085,1;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.000;CG=-1.440;CADD=1.241;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=38  GT:PL:GQ    0/1:74,0,58:61
1   15211   rs78601809  T   G   9.33    .   DP=6;VDB=9.014600e-02;RPB=-8.217058e-01;AF1=1;AC1=2;DP4=1,0,3,2;MQ=21;FQ=-37;PV4=1,0.35,1,1;DN=131;DA=T/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.145;CADD=1.611;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=171    GT:PL:GQ    1/1:41,10,0:13
1   16146   .   A   C   25  .   DP=10;VDB=2.063840e-02;RPB=-2.186229e+00;AF1=0.5;AC1=1;DP4=7,0,3,0;MQ=39;FQ=27.8;PV4=1,0.0029,1,0.0086;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=unknown;CP=0.001;CG=-0.555;CADD=2.158;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=197 GT:PL:GQ    0/1:55,0,68:58
1   16257   rs78588380  G   C   40  .   DP=18;VDB=9.421102e-03;RPB=-1.327486e+00;AF1=0.5;AC1=1;DP4=3,11,4,0;MQ=50;FQ=43;PV4=0.011,1,1,1;DN=131;DA=G/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-2.500;CADD=0.359;AA=G;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=308   GT:PL:GQ    0/1:70,0,249:73
1   16378   rs148220436 T   C   39  .   DP=7;VDB=2.063840e-02;RPB=-9.980746e-01;AF1=0.5;AC1=1;DP4=0,4,0,3;MQ=50;FQ=42;PV4=1,0.45,1,1;DN=134;DA=T/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.016;CG=-2.880;CADD=0.699;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-cluster;DSP=227    GT:PL:GQ    0/1:69,0,90:72

OUTPUT File:

1   877831  rs6672356   T   C   44.8    .   DP=2;VDB=6.720000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;DN=116;DA=T/C;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=missense,missense,missense,missense,missense,intron;FD=unknown;AAC=TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,none;PP=343/682,343/715,328/667,327/666,234/573,NA;CDP=1027,1027,982,979,700,NA;GS=101,101,101,101,101,NA;PH=0;CP=0.994;CG=2.510;CADD=0.132;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-cluster,by-1000G;DSP=38;CPG=875731-878363;GESP=C:8470/T:0;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA GT:PL:GQ    1/1:76,6,0:10
1   878000  .   C   T   44.8    .   DP=2;VDB=7.520000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=synonymous,synonymous,synonymous,synonymous,synonymous,intron;FD=unknown;AAC=LEU,LEU,LEU,LEU,LEU,none;PP=376/682,376/715,361/667,360/666,267/573,NA;CDP=1126,1126,1081,1078,799,NA;CP=0.986;CG=3.890;CADD=2.735;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DSP=62;CPG=875731-878363;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA    GT:PL:GQ    1/1:76,6,0:10
1   881627  rs2272757   G   A   205 .   DP=9;VDB=1.301207e-01;AF1=1;AC1=2;DP4=0,0,5,4;MQ=50;FQ=-54;DN=100;DA=G/A;GM=NM_015658.3,XM_005244739.1;GL=NOC2L;FG=synonymous;FD=synonymous-codon,unknown;AAC=LEU;PP=615/750,615/755;CDP=1843;CP=0.082;CG=5.170;CADD=0.335;AA=G;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-frequency,by-cluster,by-1000G;DSP=40;GESP=A:6174/G:6830;PAC=NP_056473.2,XP_005244796.1   GT:PL:GQ    1/1:238,27,0:51

Answer 1

First of all, I did not include any code because it looks like homework to me (I have had homework like this). I will however try to explain the steps I took to improve my scripts, even though I know my solutions are far from perfect.

your script could be slow because for every line in your csv file you open, write and close your output file. Try to make a list of lines you want to add to the output file, and after you are done with reading and filtering, then start writing.

You also might want to consider to write functions per filter and call these functions with the line as variable. That way you can easily add filters later on. I use a counter to keep track of the amount of succeeded filters and if in the end counter == len(amountOfUsedFilers) I add my line to the list.

Also, why do you use outpt = open('.'.join(vcf_out_split),'a') and with open(vcf,'r') as f: try to be consistent and smart in your choices.

Bioinformatics for the win!

Answer 2

If both of your files are ordered, you can save a lot of time by iterating over them in parallel, always advancing the one with lowest coordinates. This way you will only handle each line once, not many times.

Here's a basic version of your code that only does the coordinate checking (I don't fully understand your DP4 condition, so I'll leave it to you to add that part back in):

with open(coords_fn) as coords_f, open(vcf_fn) as vcf_f, open(out_fn) as out_f:
    coords = csv.reader(coords_f, delimiter="\t")
    vcf = csv.reader(vcf_f, delimiter="\t")
    out = csv.writer(out_f, delimiter="\t")

    next(vcf) # discard header row, or use out.writeline(next(vcf)) to preserve it!

    try:
        c = next(coords)
        r = next(vcf)

        while True:
            if int(c[1]) >= int(r[1]):   # vcf file is behind
                r = next(vcf)
            elif int(c[2]) < int(r[1]):  # coords file is behind
                c = next(coords)
            else:                        # int(c[1]) < int(r[1]) <= int(c[2])
                out.writeline(r)  # add DP4 check here, and indent this line under it

                r = next(vcf)     # don't indent this line
    except StopIteration: # one of the files has ended
        pass

Iterating Over Two Large Lists using Python

Question

2 answers

solution1
1 2015-03-24 07:53:50

solution2
0 2015-03-24 08:27:54

Iterating Over Two Large Lists using Python

Question

2 answers

solution1 1 2015-03-24 07:53:50

solution2 0 2015-03-24 08:27:54

solution1
1 2015-03-24 07:53:50

solution2
0 2015-03-24 08:27:54