Use levenshtein distance for keys in defaultdict in python

Question

I am doing some sequencing analysis, and i'm trying to create a default dictionary of genetic sequence based on some identifiers. So looking at the following example, I have created a dict, and put both sequences AGAGAG and ATATAT in the same list because they have the same identifier of CCCCCC :

input:

CCCCCCAGAGAG
CCCCCCATATAT

code:

from collections import defaultdict
d = defaultdict(list)
d['CCCCCC'].append('AGAGAG')
d['CCCCCC'].append('ATATAT')

The problem I have is that if the key sequence is within a levenshtein distance of 1 I want it to be treated as the same key. So if I come across a sequence that looks like this:

CCCCCTACACAC

I want to look through the dict and see that there is CCCCCC and see that distance('CCCCCC', 'CCCCCT') < 2 so maybe change CCCCCA to CCCCCC and then append to the same list as above.

Hopefully there is a good way of doing this. Thanks.

Answer 1

import numpy
biginput = [''.join([chr(y) for y in numpy.random.randint(65, 90, 6)]) 
            for x in range(100000)]
biginput[0]
'VSNRGF'

I'm thinking you have to somehow create ~6 sortings, so that for each key you have to only make a couple of comparisons. This is possible, since the Levenshtein would only need to consider a couple of variations.

In fact, you'll need some form of LSH (Locality sensitive hashing). Perhaps someone can help further.

Answer 2

You can use difflib.SequenceMatcher which returns 1 for equal sequences and you can use your difference for compare :

In this case :

>>> import difflib
>>> difflib.SequenceMatcher(None,'CCCCCC', 'CCCCCT').ratio()
0.8333333333333334

Demo :

>>> from itertools import combinations
>>> import difflib

>>> li=['AAAAAAACDCBA', 'CCCCCCATATAT', 'CCCCCCAGAGAG', 'CCCCCTACACAC', 'AAAAAAACACAC']
>>> d = defaultdict(list)
>>> for i in li:
...     d[i[:6]].append(i[6:])
... 
>>> keys=d.keys()
>>> for i,j in combinations(keys,2):
...      if difflib.SequenceMatcher(None,i, j).ratio()>0.8:
...         d[i].extend(d[j])
...         del d[j]
... 
>>> d
defaultdict(<type 'list'>, {'AAAAAA': ['ACDCBA', 'ACACAC'], 'CCCCCC': ['ATATAT', 'AGAGAG', 'ACACAC']})
>>>

Use levenshtein distance for keys in defaultdict in python

Question

2 answers

solution1
2 2015-08-19 17:06:16

solution2
1 ACCPTED 2015-08-19 16:38:28

Use levenshtein distance for keys in defaultdict in python

Question

2 answers

solution1 2 2015-08-19 17:06:16

solution2 1 ACCPTED 2015-08-19 16:38:28

solution1
2 2015-08-19 17:06:16

solution2
1 ACCPTED 2015-08-19 16:38:28