简体   繁体   中英

DNA subsequence dynamic programming question

I'm trying to solve DNA problem which is more of improved(?) version of LCS problem. In the problem, there is string which is string and semi-substring which allows part of string to have one or no letter skipped. For example, for string "desktop", it has semi-substring {"destop", "dek", "stop", "skop","desk","top"} , all of which has one or no letter skipped.

Now, I am given two DNA strings consisting of {a,t,g,c} . I"m trying to find longest semi-substring, LSS. and if there is more than one LSS, print out the one in the fastest order.

For example, two dnas {attgcgtagcaatg, tctcaggtcgatagtgac} prints out "tctagcaatg"

and aaaattttcccc, cccgggggaatatca prints out "aattc"

I'm trying to use common LCS algorithm but cannot solve it with tables although I did solve the one with no letter skipped. Any advice?

Let g(c, rs, rt) represent the longest common semi-substring of strings, S and T , ending at rs and rt , where rs and rt are the ranked occurences of the character, c , in S and T , respectively, and K is the number of skips allowed. Then we can form a recursion which we would be obliged to perform on all pairs of c in S and T.

JavaScript code:

 function f(S, T, K){ // mapS maps a char to indexes of its occurrences in S // rsS maps the index in S to that char's rank (index) in mapS const [mapS, rsS] = mapString(S) const [mapT, rsT] = mapString(T) // h is used to memoize g const h = {} function g(c, rs, rt){ if (rs < 0 || rt < 0) return 0 if (h.hasOwnProperty([c, rs, rt])) return h[[c, rs, rt]] // (We are guaranteed to be on // a match in this state.) let best = [1, c] let idxS = mapS[c][rs] let idxT = mapT[c][rt] if (idxS == 0 || idxT == 0) return best for (let i=idxS-1; i>=Math.max(0, idxS - 1 - K); i--){ for (let j=idxT-1; j>=Math.max(0, idxT - 1 - K); j--){ if (S[i] == T[j]){ const [len, str] = g(S[i], rsS[i], rsT[j]) if (len + 1 >= best[0]) best = [len + 1, str + c] } } } return h[[c, rs, rt]] = best } let best = [0, ''] for (let c of Object.keys(mapS)){ for (let i=0; i<(mapS[c]||[]).length; i++){ for (let j=0; j<(mapT[c]||[]).length; j++){ let [len, str] = g(c, i, j) if (len > best[0]) best = [len, str] } } } return best } function mapString(s){ let map = {} let rs = [] for (let i=0; i<s.length; i++){ if (!map[s[i]]){ map[s[i]] = [i] rs.push(0) } else { map[s[i]].push(i) rs.push(map[s[i]].length - 1) } } return [map, rs] } console.log(f('attgcgtagcaatg', 'tctcaggtcgatagtgac', 1)) console.log(f('aaaattttcccc', 'cccgggggaatatca', 1)) console.log(f('abcade', 'axe', 1))

This is a variation on the dynamic programming solution for LCS, written in Python.

First I'm building up a Suffix Tree for all the substrings that can be made from each string with the skip rule. Then I'm intersecting the suffix trees. Then I'm looking for the longest string that can be made from that intersection tree.

Please note that this is technically O(n^2) . Its worst case is when both strings are the same character, repeated over and over again. Because you wind up with a lot of what logically is something like, "an 'l' at position 42 in the one string could have matched against position l at position 54 in the other". But in practice it will be O(n) .

def find_subtree (text, max_skip=1):
    tree = {}
    tree_at_position = {}

    def subtree_from_position (position):
        if position not in tree_at_position:
            this_tree = {}
            if position < len(text):
                char = text[position]
                # Make sure that we've populated the further tree.
                subtree_from_position(position + 1)

                # If this char appeared later, include those possible matches.
                if char in tree:
                    for char2, subtree in tree[char].iteritems():
                        this_tree[char2] = subtree

                # And now update the new choices.
                for skip in range(max_skip + 1, 0, -1):
                    if position + skip < len(text):
                        this_tree[text[position + skip]] = subtree_from_position(position + skip)

                tree[char] = this_tree

            tree_at_position[position] = this_tree

        return tree_at_position[position]

    subtree_from_position(0)

    return tree


def find_longest_common_semistring (text1, text2):
    tree1 = find_subtree(text1)
    tree2 = find_subtree(text2)

    answered = {}
    def find_intersection (subtree1, subtree2):
        unique = (id(subtree1), id(subtree2))
        if unique not in answered:
            answer = {}
            for k, v in subtree1.iteritems():
                if k in subtree2:
                    answer[k] = find_intersection(v, subtree2[k])
            answered[unique] = answer
        return answered[unique]


    found_longest = {}
    def find_longest (tree):
        if id(tree) not in found_longest:
            best_candidate = ''
            for char, subtree in tree.iteritems():
                candidate = char + find_longest(subtree)
                if len(best_candidate) < len(candidate):
                    best_candidate = candidate
            found_longest[id(tree)] = best_candidate
        return found_longest[id(tree)]

    intersection_tree = find_intersection(tree1, tree2)
    return find_longest(intersection_tree)


print(find_longest_common_semistring("attgcgtagcaatg", "tctcaggtcgatagtgac"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM