简体   繁体   中英

find the Hamming distance between two DNA strings

i'm just learning python 3 now. '''It's ask the user for two string and find the Hamming distance between the strings.Input sequences should only include nucleotides 'A', 'T', 'G' and 'C'. Program should ask the user to reenter the sequence if user enter an invalid character.Program should be able to compare the strings are of same length. If the strings are not of the same length program should ask the user to enter the strings again.User should be able to enter upper, lower or both cases as an input '''

The program should print the output in the following format:

please enter string one: GATTACA
please enter string two: GACTATA
GATTACA
|| || |  
GACTATA
The hamming distance of sequence GATTACA and GACTATA is 2
So the Hamming distance is 2.

What I already try below, but could not get answer.

def hamming_distance(string1, string2):
    string1 = input("please enter first sequence")
    string2 = input("please enter second sequence")
    distance = 0
     L = len(string1)
    for i in range(L):
        if string1[i] != string2[i]:
            distance += 1
    return distance

the line indent error: L = len(strings1)

def hamming_distance(s1, s2):
    if len(s1) != len(s2):
        raise ValueError("Strand lengths are not equal!")
    return sum(ch1 != ch2 for ch1,ch2 in zip(s1,s2))

Alternatively, you could use this. I also added a check that raises an exception because the hamming distance is only defined for sequences of equal length, so an attempt to calculate it between sequences of different lengths should not work.

def distance(str1, str2):
    if len(str1) != len(str2):
        raise ValueError("Strand lengths are not equal!")
    else:
        return sum(1 for (a, b) in zip(str1, str2) if a != b)

Wiki page has elegant python and C implementations for computing hamming distance . This implementation assumes that hamming distance is invalid for sequences of varying length. However, there are two possible ways to report/compute distance for strings of varying length:

1) Perform multiple sequence alignment and then compute hamming distance between the two gap-filled character arrays ... formally referred to as edit distance or Levenshtein distance .

2) Alternatively, one could use the zip_longest function from iterttools. The following implementation will be equivalent to adding a string of gap characters at the end of shorter string so as to match the length of the longer string. [Note: As compared to approach 1 value returned by this method would be an over-estimate of the distance as it doesn't account for alignment]

import itertools

def hammingDist(str1, str2, fillchar = '-'):
    return sum([ch1 != ch2 for (ch1,ch2) in itertools.zip_longest(str1, str2, fillvalue = fillchar)])


def main():
    # Running test cases:    
    print('Expected value \t Value returned')
    print(0,'\t', hammingDist('ABCD','ABCD'))
    print(1,'\t', hammingDist('ABCD','ABED'))
    print(2,'\t', hammingDist('ABCD','ABCDEF'))
    print(2,'\t', hammingDist('ABCDEF','ABCD'))
    print(4,'\t', hammingDist('ABCD',''))
    print(4,'\t', hammingDist('','ABCD'))
    print(1,'\t', hammingDist('ABCD','ABcD'))

if __name__ == "__main__":
    main()    
    import itertools

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM