iterating through a list in python

Question

I have a list of sequences l (many 1000's of sequences): l = [ABCD,AABA,...] . I also have a file f with many 4 letter sequences (around a million of them). I want to choose the closest string in l for every sequence in f up to a Hamming distance of atmost 2, and update the counter good_count . I wrote the following code for this but it is very slow. I was wondering if it can be done faster.

def hamming(s1, s2):
    if len(s1) != len(s2):
            raise ValueError("Undefined for sequences of unequal length")  
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

f = open("input.txt","r")

l = [ABCD,AABA,...]

good_count = 0
for s in f:
     x = f.readline()
     dist_array = []
     for ll in l:
        dist = hamming(x,ll)
        dist_array.append(dist)
     min_dist = min(dist_array)
     if min_dist <= 2:
        good_count += 1
print good_count

It works fast if l and f are small but takes too long for large l and f . Please suggest a quicker solution.

Answer 1

Use existing libraries, for instance jellyfish:

from jellyfish import hamming_distance

Which gives you a C implementation of the hamming distance.

Answer 2

Oh, you're just counting how MANY have matches with a hamming distance < 2? That can be done much quicker.

total_count = 0

for line in f:
    # skip the s = f.readline() since that's what `line` is in this
    line = line.strip()  # just in case
    for ll in l:
        if hamming(line, ll) <= 2:
            total_count += 1
            break  # skip the rest of the ll in l loop
    # and then you don't need any processing afterwards either.

Note that most of your code time will be spent on the line:

        if hamming(line, ll) <= 2:

So any way you can improve that algorithm will GREATLY improve your overall script speed. Boud's answer extols the virtues of jellyfish 's hamming_distance function, but without any personal experience I can't recommend it myself. However his advice to use a faster implementation of hamming distance is sound!

Peter DeGlopper suggests blowing the l list into six different sets of "Two or less hamming distance" matches. That is, a group of sets that contain all the possible pairs that could have two or less hamming distance. This might look like:

# hamming_sets is [ {AB??}, {A?C?}, {A??D}, {?BC?}, {?B?D}, {??CD} ]
hamming_sets = [ set(), set(), set(), set(), set(), set() ]

for ll in l:
    # this should take the lion's share of time in your program
    hamming_sets[0].add(l[0] + l[1])
    hamming_sets[0].add(l[0] + l[2])
    hamming_sets[0].add(l[0] + l[3])
    hamming_sets[0].add(l[1] + l[2])
    hamming_sets[0].add(l[1] + l[3])
    hamming_sets[0].add(l[2] + l[3])

total_count = 0

for line in f:
    # and this should be fast, even if `f` is large
    line = line.strip()
    if line[0]+line[1] in hamming_sets[0] or \
       line[0]+line[2] in hamming_sets[1] or \
       line[0]+line[3] in hamming_sets[2] or \
       line[1]+line[2] in hamming_sets[3] or \
       line[1]+line[3] in hamming_sets[4] or \
       line[2]+line[3] in hamming_sets[5]:
        total_count += 1

You could possibly gain readability by making hamming_sets a dictionary of transform_function: set_of_results key value pairs.

hamming_sets = {lambda s: s[0]+s[1]: set(),
                lambda s: s[0]+s[2]: set(),
                lambda s: s[0]+s[3]: set(),
                lambda s: s[1]+s[2]: set(),
                lambda s: s[1]+s[3]: set(),
                lambda s: s[2]+s[3]: set()}

for func, set_ in hamming_sets.items():
    for ll in l:
        set_.add(func(ll))

total_count = 0

for line in f:
    line = line.strip()
    if any(func(line) in set_ for func, set_ in hamming_sets.items()):
        total_count += 1

iterating through a list in python

Question

2 answers

solution1
2 2017-01-17 20:18:01

solution2
2 2017-01-17 20:22:52

iterating through a list in python

Question

2 answers

solution1 2 2017-01-17 20:18:01

solution2 2 2017-01-17 20:22:52

solution1
2 2017-01-17 20:18:01

solution2
2 2017-01-17 20:22:52