I am doing some sequencing analysis, and i'm trying to create a default dictionary of genetic sequence based on some identifiers. So looking at the following example, I have created a dict, and put both sequences AGAGAG
and ATATAT
in the same list because they have the same identifier of CCCCCC
:
input:
CCCCCCAGAGAG
CCCCCCATATAT
code:
from collections import defaultdict
d = defaultdict(list)
d['CCCCCC'].append('AGAGAG')
d['CCCCCC'].append('ATATAT')
The problem I have is that if the key sequence is within a levenshtein distance of 1 I want it to be treated as the same key. So if I come across a sequence that looks like this:
CCCCCTACACAC
I want to look through the dict and see that there is CCCCCC
and see that distance('CCCCCC', 'CCCCCT') < 2
so maybe change CCCCCA
to CCCCCC
and then append to the same list as above.
Hopefully there is a good way of doing this. Thanks.
import numpy
biginput = [''.join([chr(y) for y in numpy.random.randint(65, 90, 6)])
for x in range(100000)]
biginput[0]
'VSNRGF'
I'm thinking you have to somehow create ~6 sortings, so that for each key you have to only make a couple of comparisons. This is possible, since the Levenshtein would only need to consider a couple of variations.
In fact, you'll need some form of LSH (Locality sensitive hashing). Perhaps someone can help further.
You can use difflib.SequenceMatcher
which returns 1 for equal sequences and you can use your difference for compare :
In this case :
>>> import difflib
>>> difflib.SequenceMatcher(None,'CCCCCC', 'CCCCCT').ratio()
0.8333333333333334
Demo :
>>> from itertools import combinations
>>> import difflib
>>> li=['AAAAAAACDCBA', 'CCCCCCATATAT', 'CCCCCCAGAGAG', 'CCCCCTACACAC', 'AAAAAAACACAC']
>>> d = defaultdict(list)
>>> for i in li:
... d[i[:6]].append(i[6:])
...
>>> keys=d.keys()
>>> for i,j in combinations(keys,2):
... if difflib.SequenceMatcher(None,i, j).ratio()>0.8:
... d[i].extend(d[j])
... del d[j]
...
>>> d
defaultdict(<type 'list'>, {'AAAAAA': ['ACDCBA', 'ACACAC'], 'CCCCCC': ['ATATAT', 'AGAGAG', 'ACACAC']})
>>>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.