简体   繁体   中英

Sequence comparison function not working as expected

Amateur Python coder trying to learn here, so I wanted to ask whats going on with my script. I cant really work out where its going wrong. (think around line 24 or difference = "%s-%s [%d]" %( seq1[i], seq2[i], i) ).

The function is to take a load of sequences, rename them (this bit is working) and then compare each sequence to a reference sequence (here just the first sequence in the file) and if a letter in the sequence does not match the reference print the difference and location. However as you can see below this is not working

Here is a mock up input file - http://pastebin.com/AH2zxdBn

import re
from Bio.Alphabet import generic_dna, generic_protein
from Bio import SeqIO

def compare_seqs( seq1, seq2 ):

  similar = 0
  diff    = 0

  diff_positions = []

  for i in range(0, len( seq1 )):
    if ( seq1[ i ] != seq2[ i ]):
      difference = "%s-%s [%d]" %( seq1[i], seq2[i], i)
      diff_positions.append( difference )
#    else:
#       similar += 1


  return ",".join( diff_positions )


new_seq = []

reference_sequence = ""
reference_name     = ""

outfile = open("test_out.csv", 'w')

for record in SeqIO.parse(open('test.fa', 'ru'), 'fasta', generic_protein):
    record_id = re.sub(r'\d+_(\d+_\d\#\d+)_\d+', r'\1', record.id)


    if ( not reference_sequence ):
      reference_sequence = record.seq
      reference_name     = record_id
      #continue
    print "\t".join([reference_name, record_id, compare_seqs(reference_sequence, record.seq)])

Here is the output I am getting, which is incorrect as pos 454 in 7065_8#4 actually = P

7065_8#1    7065_8#1    
7065_8#1    7065_8#2    
7065_8#1    7065_8#3    
7065_8#1    7065_8#4    E-G [245]
7065_8#1    7065_8#5

The best way to troubleshoot this is definitely breaking it down into smaller pieces and verifying each one.

Here's a minimal difference implementation:

def compare_sequences(seq1, seq2):
    for index, (a, b) in enumerate(zip(seq1, seq2)):
        if a != b:
            yield index, a, b

Here it is working:

print list(compare_sequences('abcdef', 'abddef'))

Which gives me

[(2, 'c', 'd')]

You can use this as a simple proof that it's working. What I'd recommend doing is isloating the input into your function and verifying that it works as expected.

Maybe there's an issue with the input having whitespace or a newline where you do not expect it which is throwing everything off?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM