Use levenshtein distance for keys in defaultdict in python

I am doing some sequencing analysis, and i'm trying to create a default dictionary of genetic sequence based on some identifiers. So looking at the following example, I have created a dict, and put both sequences AGAGAG and ATATAT in the same list because they have the same identifier of CCCCCC :




from collections import defaultdict
d = defaultdict(list)

The problem I have is that if the key sequence is within a levenshtein distance of 1 I want it to be treated as the same key. So if I come across a sequence that looks like this:


I want to look through the dict and see that there is CCCCCC and see that distance('CCCCCC', 'CCCCCT') < 2 so maybe change CCCCCA to CCCCCC and then append to the same list as above.

Hopefully there is a good way of doing this. Thanks.

import numpy
biginput = [''.join([chr(y) for y in numpy.random.randint(65, 90, 6)]) 
            for x in range(100000)]

I'm thinking you have to somehow create ~6 sortings, so that for each key you have to only make a couple of comparisons. This is possible, since the Levenshtein would only need to consider a couple of variations.

In fact, you'll need some form of LSH (Locality sensitive hashing). Perhaps someone can help further.

You can use difflib.SequenceMatcher which returns 1 for equal sequences and you can use your difference for compare :

In this case :

>>> import difflib
>>> difflib.SequenceMatcher(None,'CCCCCC', 'CCCCCT').ratio()

Demo :

>>> from itertools import combinations
>>> import difflib

>>> d = defaultdict(list)
>>> for i in li:
...     d[i[:6]].append(i[6:])
>>> keys=d.keys()
>>> for i,j in combinations(keys,2):
...      if difflib.SequenceMatcher(None,i, j).ratio()>0.8:
...         d[i].extend(d[j])
...         del d[j]
>>> d
defaultdict(<type 'list'>, {'AAAAAA': ['ACDCBA', 'ACACAC'], 'CCCCCC': ['ATATAT', 'AGAGAG', 'ACACAC']})

