簡體   English   中英

查找字符串中子字符串的最佳匹配

[英]Finding the best match of a substring in a string

我試圖在一個更大的字符串中找到一個子字符串。 我已經能夠做到這一點以進行精確匹配。 但是我希望能夠在較大的字符串中搜索子字符串的最佳匹配。

例如:

seq1: ATGCTGCTA
seq2: CAGTCATGCATGCATCGATCAGTCAGCAATGCTGCTACGAGACGGTGGCCTAGAGTCGCATGCA
seq3= "ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCAC"

使用 BioPython pairwise2 我有一個短代碼,將第一個序列與第二個序列對齊,將第一個序列與第三個序列對齊。

from Bio import pairwise2
from Bio.pairwise2 import format_alignment



seq1= "TCCCAGGTAACAAACCAACCAACTTTCG"
seq2= "CAGTCATGCATGCATCGATCAGTCAGCAATGCTGCTACGAGACGGTGGCCTAGAGTCGCATGCA"
seq3= "ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCAC"

for a in pairwise2.align.localms(seq1, seq2, 2, -1, -0.5, -0.1):
    print(format_alignment(*a))
    
for a in pairwise2.align.localms(seq1, seq3, 2, -1, -0.5, -0.1):
    print(format_alignment(*a))

簡而言之,這段代碼是說將 seq1 與 seq2(或 seq3)對齊,如果存在完美匹配,則如果得分不匹配 -1,則得分加 2,如果得分差距 -0.5。 最高分是 seq1 長度的兩倍。

我有兩個問題。 首先,我只需要返回分數,此時它會返回一個對齊和這樣的分數:

 1 GTGAAATGGTCATGTGTGGCAGTTCACTATATG
   ||||||||||||||||||||.||||||||||||
15431 GTGAAATGGTCATGTGTGGCGGTTCACTATATG
  Score=63

第二個問題是速度,當將兩個短序列相互對齊時,代碼運行速度非常快; 但是當我將一個短序列(~25 個字母)與一個~30,000 個字母的長序列對齊時,大約需要 15 秒,這將是一個問題,因為我必須將大約 50,000 個短序列比對到長序列。 誰能建議如何加快速度?

非常感謝你的幫助。

您正在尋找的解決方案在這篇文章中: Sequence Matcher

from difflib import SequenceMatcher

def get_best_match(query, corpus, step=4, flex=3, case_sensitive=False, verbose=False):
    """Return best matching substring of corpus.

    Parameters
    ----------
    query : str
    corpus : str
    step : int
        Step size of first match-value scan through corpus. Can be thought of
        as a sort of "scan resolution". Should not exceed length of query.
    flex : int
        Max. left/right substring position adjustment value. Should not
        exceed length of query / 2.

    Outputs
    -------
    output0 : str
        Best matching substring.
    output1 : float
        Match ratio of best matching substring. 1 is perfect match.
    """

    def _match(a, b):
        """Compact alias for SequenceMatcher."""
        return SequenceMatcher(None, a, b).ratio()

    def scan_corpus(step):
        """Return list of match values from corpus-wide scan."""
        match_values = []

        m = 0
        while m + qlen - step <= len(corpus):
            match_values.append(_match(query, corpus[m : m-1+qlen]))
            if verbose:
                print(query, "-", corpus[m: m + qlen], _match(query, corpus[m: m + qlen]))
            m += step

        return match_values

    def index_max(v):
        """Return index of max value."""
        return max(range(len(v)), key=v.__getitem__)

    def adjust_left_right_positions():
        """Return left/right positions for best string match."""
        # bp_* is synonym for 'Best Position Left/Right' and are adjusted 
        # to optimize bmv_*
        p_l, bp_l = [pos] * 2
        p_r, bp_r = [pos + qlen] * 2

        # bmv_* are declared here in case they are untouched in optimization
        bmv_l = match_values[p_l // step]
        bmv_r = match_values[p_l // step]

        for f in range(flex):
            ll = _match(query, corpus[p_l - f: p_r])
            if ll > bmv_l:
                bmv_l = ll
                bp_l = p_l - f

            lr = _match(query, corpus[p_l + f: p_r])
            if lr > bmv_l:
                bmv_l = lr
                bp_l = p_l + f

            rl = _match(query, corpus[p_l: p_r - f])
            if rl > bmv_r:
                bmv_r = rl
                bp_r = p_r - f

            rr = _match(query, corpus[p_l: p_r + f])
            if rr > bmv_r:
                bmv_r = rr
                bp_r = p_r + f

            if verbose:
                print("\n" + str(f))
                print("ll: -- value: %f -- snippet: %s" % (ll, corpus[p_l - f: p_r]))
                print("lr: -- value: %f -- snippet: %s" % (lr, corpus[p_l + f: p_r]))
                print("rl: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r - f]))
                print("rr: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r + f]))

        return bp_l, bp_r, _match(query, corpus[bp_l : bp_r])

    if not case_sensitive:
        query = query.lower()
        corpus = corpus.lower()

    qlen = len(query)

    if flex >= qlen/2:
        print("Warning: flex exceeds length of query / 2. Setting to default.")
        flex = 3

    match_values = scan_corpus(step)
    pos = index_max(match_values) * step

    pos_left, pos_right, match_value = adjust_left_right_positions()

    return corpus[pos_left: pos_right].strip(), match_value

我提供了一個名為 Bio-Jupiter 的實現的鏈接,它允許直接在您的瀏覽器上對這個問題進行實驗。

https://github.com/gagniuc/Jupiter-Bioinformatics-V2-normal

直播:生物木星

如果實現對您很重要,這里有一個 javascript 實現:

 // Variable statement var Match = +2; var Mismatch = -1; var gap = -2; var s0 = 'AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTT'; var s1 = 'GAAATGATCCGGAAATTGCAGCCTCAGCCCCCAGCCATCTGCTAACCCC'; var AlignmentA = ""; var AlignmentM = ""; var AlignmentB = ""; var e = '&emsp;'; var m = []; var s = []; var MMax = 0; var MMin = 0; var x = 0; var y = 0; // Matrix initialization and completion s[0] = [] = s0.split(''); s[1] = [] = s1.split(''); var n_0 = s[0].length + 1; var n_1 = s[1].length + 1; for(var i=0; i<=n_0; i++) { m[i]=[]; for(var j=0; j<=n_1; j++) { m[i][j]=0; if (i==1 && j>1) {m[i][j]=m[i][j-1]+gap;} if (j==1 && i>1) {m[i][j]=m[i-1][j]+gap;} if (i>1) {m[i][0]=s[0][i-2];} if (j>1) {m[0][j]=s[1][j-2];} if(i>1 && j>1){ var A = m[i-1][j-1] + f(m[i][0],m[0][j]); //'\\ var B = m[i-1][j] + gap; //'- var C = m[i][j-1] + gap; //'| var D = 0; m[i][j] = Math.max(A, B, C, D); if(m[i][j] > MMax){MMax = m[i][j];x=i;y=j;} if(m[i][j] < MMin){MMin = m[i][j];} } } } //Traceback & text alignment var i = x; var j = y; while (i>=2 || j>=2) { var Ai = m[i][0]; var Bj = m[0][j]; A = m[i-1][j-1] + f(Ai, Bj); B = m[i-1][j] + gap; C = m[i][j-1] + gap; if(i>=2 && j>=2 && m[i][j]==A) { AlignmentA = Ai + AlignmentA; AlignmentB = Bj + AlignmentB; if(Ai==Bj){ AlignmentM = '|' + AlignmentM; } else { AlignmentM = e + AlignmentM; } i = i - 1; j = j - 1; } else { if(i>=2 && m[i][j]==B) { AlignmentA = Ai + AlignmentA; AlignmentB = '-' + AlignmentB; AlignmentM = e + AlignmentM; i = i - 1; } else { AlignmentA = '-' + AlignmentA; AlignmentB = Bj + AlignmentB; AlignmentM = e + AlignmentM; j = j - 1; } } var r1 = i - 1; var r2 = j - 1; if(m[i][j]<=0){break;} } // LAYOUT var tM=''; var tS=''; // Check the end AlignmentA = AlignmentA + s0.substr(x-1, n_0 - x); AlignmentB = AlignmentB + s1.substr(y-1, n_1 - y); // Check the beginning AlignmentA = s0.substr(0, r1) + AlignmentA; AlignmentB = s1.substr(0, r2) + AlignmentB; if(r1>r2){ var v = r1 - r2; for(var u=1; u<=v; u++) {tS = tS + e;} for(var u=1; u<=v+r2; u++) {tM = tM + e;} AlignmentB = tS + AlignmentB; AlignmentM = tM + AlignmentM; } else { var v = r2 - r1; for(var u=1; u<=v; u++) {tS = tS + e;} for(var u=1; u<=v+r1; u++) {tM = tM + e;} AlignmentA = tS + AlignmentA; AlignmentM = tM + AlignmentM; } // Print the alignment document.write(AlignmentA + '<br>'); document.write(AlignmentM + '<br>'); document.write(AlignmentB + '<br>'); // Matching function function f(a1, a2) { if(a1 === a2){return Match;} else {return Mismatch;} }
 body { padding: 1rem; font-family: monospace; font-size: 18px; font-style: normal; font-variant: normal; line-height: 20px; }

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM