简体   繁体   中英

Rough string alignment in python

If I have two strings of equal length like the following:

'aaaaabbbbbccccc'
'bbbebcccccddddd'

Is there an efficient way to align the two such that the most letters as possible line up as shown below?

'aaaaabbbbbccccc-----'
'-----bbbebcccccddddd'

The only way I can think of doing this is brute force by editing the strings and then iterating through and comparing.

I'm not sure what you mean by efficient, but you can use the find method on str:

first = 'aaaaabbbbbccccc'
second = 'bbbebcccccddddd'
second_prime = '-'* first.find(second[0]) + second
first_prime = first + '-' * (len(second_prime) - len(first))
print first_prime + '\n' + second_prime
# Output:
# aaaaabbbbbccccc-----
# -----bbbebcccccddddd

Return the index which gives the maximum score, where the maximum score is the strings which have the most matching characters.

def best_overlap(a, b):
    return max([(score(a[offset:], b), offset) for offset in xrange(len(a))], key=lambda x: x[0])[1]

def score(a, b):
    return sum([a[i] == b[i] for i in xrange(len(a))])

>>> best_overlap(a, b)
5
>>> a + '-' * best_overlap(a, b); '-' * best_overlap(a, b) + b
'aaaaabbbbbccccc-----'
'-----bbbebcccccddddd'

Or, equivalently:

def best_match(a, b):
    max = 0
    max_score = 0
    for offset in xrange(len(a)):
        val = score(a[offset:], b)
        if val > max_score:
            max_score = val
            max = offset
    return max

There is room for optimizations such as:

  1. Early exit for no matching characters

  2. Early exit when maximum possible match found

I can't see any other way than brute forcing it. The complexity will be quadratic in the string length, which might be acceptable, depending on what string lengths you are working with.

Something like this maybe:

def align(a, b):
    best, best_x = 0, 0
    for x in range(len(a)):
        s = sum(i==j for (i,j) in zip(a[x:],b[:-x]))
        if s > best:
            best, best_x = s, x
    return best_x

align('aaaaabbbbbccccc', 'bbbebcccccddddd')
5

I would do something like the binary & function on each of your strings. Compares each of the strings when they are lined up, counting up the number of times letters match. Then, shift by one and do the same thing, and go on and on with shifting until they are no longer lined up. The shift with the most matching letters in this fashion is the correct output shift, and you can add the dashes when you print it out. You don't actually have to modify the strings for this, just count the number of shifts and offset your comparing of the characters by that shift amount. This is not terribly efficient (O(n^2) = n+(n-2)+(n-4)...), but is the best I could come up with.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM