Comparing part of a line with every other line in another file in python

Question

I'm trying to compare a line in one file and put every matching line in another file in an output file. For example here is the first file.

chr8    18      .       T       T       *       *
chr8    29      .       C       T       .       .
chr9    21      .       TA      T       .       .
chr18    22      .       C       T       .       .
chr18    23      .       A       G       .       .

Then here's the other file:

chr8    ensembl CDS     1       1042    .       -       0       gene_id "ENSCAFG00000031632"; gene_version "1"; transcript_id "ENSCAFT00000048171"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000042624"; protein_version "1";
chr8    ensembl CDS     27     1227    .       +       0       gene_id "ENSCAFG00000032228"; gene_version "1"; transcript_id "ENSCAFT00000037896"; transcript_version "2"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000033535"; protein_version "2";
chr8    ensembl CDS     41      1006    .       -       0       gene_id "ENSCAFG00000029302"; gene_version "1"; transcript_id "ENSCAFT00000048043"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000036901"; protein_version "1";

And the output I want is:

chr8    18      .       T       T       *       *
chr8    ensembl CDS     1       1042    .       -       0       gene_id "ENSCAFG00000031632"; gene_version "1"; transcript_id "ENSCAFT00000048171"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000042624"; protein_version "1";
chr8    29      .       C       T       .       .   
chr8    ensembl CDS     1       1042    .       -       0       gene_id "ENSCAFG00000031632"; gene_version "1"; transcript_id "ENSCAFT00000048171"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000042624"; protein_version "1";
chr8    ensembl CDS     27     1227    .       +       0       gene_id "ENSCAFG00000032228"; gene_version "1"; transcript_id "ENSCAFT00000037896"; transcript_version "2"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000033535"; protein_version "2";

So I want to take every line of the first file and find every line and search if the first column matches, and the second number in file 1 is within the range of columns 4 and 5 if column 1 matches. Then if they match write a new file with the lines in the first file with every matching line in file 2 under it. Here's what I tried:

opt=''
with open('file1.vcf') as vfh:
    with open('file2.gtf') as gfh:
        for line in vfh:
                ct=0
                vll=line.split('\t')
                for gline in gfh:
                    gll=gline.split('\t')
                    if vll[0] == gll[0]:
                        if (int(vll[1]) > int(gll[3])) and (int(vll[1]) < int(gll[4])):
                            while ct < 1:
                                opt+=line
                                ct+=1
                            opt+=gline
with open('out.txt','w') as fh:
    fh.write(opt)

But I never get an output that I want.

Answer 1

I believe your indexes are wrong.

if (int(vll[1]) > int(gll[3])) and (int(vll[1]) < int(gll[4])):

"vll[1]" is 18 "gll[3]" is 1042 because " ensembl CDS " seems to be separated by " " not "\\t" Please try to step with debugger and verify indexes.

Answer 2

Found the issue, just needed to move my with open statement. Plus I added something to deal with some comments in the original file:

with open('a1.vcf') as vfh:
    for line in vfh:
        if '#' not in line[0]:
            ct=0
            vll=line.split('\t')
            with open('cds.gtf') as gfh:
                for gline in gfh:
                    gll=gline.split('\t')
                    if vll[0] == gll[0]:
                        if (int(vll[1]) > int(gll[3])) and (int(vll[1]) < int(gll[4])):
                            while ct < 1:
                                opt+=line
                                ct+=1
                            opt+=gline

Comparing part of a line with every other line in another file in python

Question

2 answers

solution1
0 2018-07-09 06:19:05

solution2
0 2018-07-10 17:39:17

Comparing part of a line with every other line in another file in python

Question

2 answers

solution1 0 2018-07-09 06:19:05

solution2 0 2018-07-10 17:39:17

solution1
0 2018-07-09 06:19:05

solution2
0 2018-07-10 17:39:17