简体   繁体   English

python中的粗略字符串对齐

[英]Rough string alignment in python

If I have two strings of equal length like the following: 如果我有两个长度相等的字符串,如下所示:

'aaaaabbbbbccccc'
'bbbebcccccddddd'

Is there an efficient way to align the two such that the most letters as possible line up as shown below? 有没有一种有效的方法来对齐两个,以便尽可能多的字母排列如下所示?

'aaaaabbbbbccccc-----'
'-----bbbebcccccddddd'

The only way I can think of doing this is brute force by editing the strings and then iterating through and comparing. 我能想到的唯一方法是通过编辑字符串,然后进行迭代和比较来实现蛮力。

I'm not sure what you mean by efficient, but you can use the find method on str: 我不确定效率是什么意思,但是可以在str上使用find方法:

first = 'aaaaabbbbbccccc'
second = 'bbbebcccccddddd'
second_prime = '-'* first.find(second[0]) + second
first_prime = first + '-' * (len(second_prime) - len(first))
print first_prime + '\n' + second_prime
# Output:
# aaaaabbbbbccccc-----
# -----bbbebcccccddddd

Return the index which gives the maximum score, where the maximum score is the strings which have the most matching characters. 返回给出最大分数的索引,其中最大分数是具有最匹配字符的字符串。

def best_overlap(a, b):
    return max([(score(a[offset:], b), offset) for offset in xrange(len(a))], key=lambda x: x[0])[1]

def score(a, b):
    return sum([a[i] == b[i] for i in xrange(len(a))])

>>> best_overlap(a, b)
5
>>> a + '-' * best_overlap(a, b); '-' * best_overlap(a, b) + b
'aaaaabbbbbccccc-----'
'-----bbbebcccccddddd'

Or, equivalently: 或者,等效地:

def best_match(a, b):
    max = 0
    max_score = 0
    for offset in xrange(len(a)):
        val = score(a[offset:], b)
        if val > max_score:
            max_score = val
            max = offset
    return max

There is room for optimizations such as: 有优化的空间,例如:

  1. Early exit for no matching characters 提前退出没有匹配的字符

  2. Early exit when maximum possible match found 找到最大可能匹配项时提早退出

I can't see any other way than brute forcing it. 除了强行使用外,我看不到其他任何方式。 The complexity will be quadratic in the string length, which might be acceptable, depending on what string lengths you are working with. 字符串长度的复杂度将是二次的,这可能是可以接受的,具体取决于您使用的字符串长度。

Something like this maybe: 可能是这样的:

def align(a, b):
    best, best_x = 0, 0
    for x in range(len(a)):
        s = sum(i==j for (i,j) in zip(a[x:],b[:-x]))
        if s > best:
            best, best_x = s, x
    return best_x

align('aaaaabbbbbccccc', 'bbbebcccccddddd')
5

I would do something like the binary & function on each of your strings. 我会在您的每个字符串上执行类似于二进制&函数的操作。 Compares each of the strings when they are lined up, counting up the number of times letters match. 比较每个字符串的排列顺序,并计算字母匹配的次数。 Then, shift by one and do the same thing, and go on and on with shifting until they are no longer lined up. 然后,移动一个并执行相同的操作,然后继续移动直到它们不再对齐为止。 The shift with the most matching letters in this fashion is the correct output shift, and you can add the dashes when you print it out. 以这种方式,字母最匹配的移位是正确的输出移位,可以在打印出来时添加破折号。 You don't actually have to modify the strings for this, just count the number of shifts and offset your comparing of the characters by that shift amount. 实际上,您不必为此修改字符串,只需计算移位数,然后将比较字符偏移该移位量即可。 This is not terribly efficient (O(n^2) = n+(n-2)+(n-4)...), but is the best I could come up with. 这并不是非常有效(O(n ^ 2)= n +(n-2)+(n-4)...),但这是我能想到的最好的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM