如何更正比较包含按键的两个字符串的算法？

Question

This is the algorithm to return true if two strings are equivalent. 如果两个字符串相等，此算法将返回true。 The string may contain keypresses like backspace. 该字符串可能包含按键，例如退格键。 The code uses a cursor and pointers to go through each letter in the strings and skip 2 positions if it finds a keypress (ie\\b) 该代码使用光标和指针遍历字符串中的每个字母，如果找到按键，则跳过2个位置（即\\ b）

#!/usr/bin/env python
import argparse
import __builtin__

# Given two different strings, one with backspaces (keypresses), find if they are equivalent or not

def main():
    parser = argparse.ArgumentParser(description="Enter two strings without or without backspaces")
    parser.add_argument("s1", type=str, help="The first string.")
    parser.add_argument("s2", type=str, help="The second string.")
    args = parser.parse_args()
    print(compare(args.s1, args.s2))

def compare(s1, s2):
    BACKSPACE = '\b'
    cursor = 0;
    pointer1 = 0; pointer2 = 0; # current position in backspaced string. 

    canon_len1 = len(s1); canon_len2 = len(s2); # length of the canonical string

    num_diff = 0
    while True:
        if s1[pointer1] == BACKSPACE or s2[pointer2] == BACKSPACE:
            # decrement the cursor and undo the previous compare
            cursor -= 1; 
            if s1[cursor] != s2[cursor]:
                num_diff -= 1
            # decrement the canonical lengths appropriately
            canon_len1 -= 2 if s1[pointer1] == BACKSPACE else 0
            canon_len2 -= 2 if s2[pointer2] == BACKSPACE else 0
        else:

            if s1[pointer1] != s2[pointer2]:
                num_diff += 1
            cursor += 1

        # increment the pointers, making sure we don't run off then end 
        pointer1 += 1; pointer2 += 1;
        if pointer1 == len(s1) and pointer2 == len(s2):
            break
        if pointer1 == len(s1): pointer1 -= 1
        if pointer2 == len(s2): pointer2 -= 1

    return num_diff == 0 and canon_len1 == canon_len2

if __name__ == "__main__":
    main()

#!/usr/bin/env python

import compare_strings
import unittest

class compare_strings_test(unittest.TestCase):

    def test_01(self):
        raised = False
        try:
            compare_strings.compare('Toronto', 'Cleveland')
        except:
            raised = True
        self.assertFalse(raised, 'Exception raised')

    def test_02(self):
        equivalent = compare_strings.compare('Toronto', 'Cleveland')
        self.assertEquals(equivalent, False)

    def test_03(self):
        equivalent = compare_strings.compare('Toronto', 'Toroo\b\bnto')
        self.assertEquals(equivalent, False)

    def test_04(self):
        equivalent = compare_strings.compare('Toronto', 'Torooo\b\bntt\bo')
        self.assertEquals(equivalent, True)

if __name__ == "__main__":
    unittest.main()

...F
======================================================================
FAIL: test_04 (__main__.compare_strings_test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "compare_strings_test.py", line 26, in test_04
    self.assertEquals(equivalent, True)
AssertionError: False != True

----------------------------------------------------------------------
Ran 4 tests in 0.001s

Test 4 fails, but 'Toronto' and 'Torooo\\b\\bntt\\bo' should be equivalent minus the backspaces 测试4失败，但是'Toronto'和'Torooo \\ b \\ bntt \\ bo'应该等于减去空格

Answer 1

Better remove backspaces from string beforehand with a function like: 最好事先使用如下函数从字符串中删除退格键：

def normalize(s):
    result = []
    for c in s:
        if c == '\b':
            result.pop()  // A try-catch block could be added here
        else:
            result.append(c)

    return "".join(result)

and compare then. 然后比较。

Answer 2

I believe the issue in your current code stems from the fact you can have multiple backspaces in a row, yet you only look back "one" character. 我相信您当前代码中的问题源于您可以连续拥有多个退格键的事实，但是您只能回头看看“一个”字符。 (I may be wrong on this, I haven't stepped through the code with pdb.) （对此我可能是错的，我没有使用pdb逐步执行代码。）

As suggested in the comments, a decent method to break down this problem would be to split it into the following two parts. 如评论中所建议，解决此问题的一种不错的方法是将其分为以下两个部分。

Canonicalize/Normalize both of the input strings. 对两个输入字符串进行规范化/规范化。 This means processing them one at a time, stripping out backspace and the relevant previous characters from each string. 这意味着一次处理一个，从每个字符串中去除退格和相关的先前字符。
Compare the two normalized strings. 比较两个规范化的字符串。

Step 2 is easy, just use the built-in string compare method (== in python). 第2步很简单，只需使用内置的字符串比较方法（在python中==）。

Step 1 is a little harder as you may potentially have multiple backspaces in a row in the input string. 步骤1有点困难，因为您可能在输入字符串的一行中有多个退格键。 One way to handle this is to build up a new string one character at a time, and on each backspace, delete the last added character. 一种解决方法是一次建立一个新字符串，每个字符一个，然后在每个退格键上删除最后添加的字符。 Here is some sample code. 这是一些示例代码。

def canonicalize(s):
    normalized_s = ""
    for i, c in enumerate(s):
        # Check for a backspace, taking care not to run off the end of the string.
        if c == BACKSPACE:
            normalized_s = normalized_s[:-1]
        else:
            normalized_s += c

    return normalized_s

One nice side effect with this approach is that leading backspaces do not cause any errors, they are ignored. 这种方法的一个很好的副作用是，前导退格键不会引起任何错误，因此会被忽略。 I'll try to keep this property in other implementations later. 稍后，我将尝试将此属性保留在其他实现中。 This code in a language like c++ where strings can be modified could be made efficient rather easily since it would be akin to changing a pointer and the entries into a char array. 用类似于c ++的语言可以修改字符串的代码可以相当容易地变得高效，因为它类似于将指针和条目更改为char数组。

In python, each edit would create a new string, (or at least, there is no guarantee it would not allocate a new string). 在python中，每次编辑都会创建一个新字符串，（或者至少不能保证它不会分配新字符串）。 I think minding your own stack (aka an array consisting of characters with a pointer to the end) could make for better code. 我认为，介意自己的堆栈（也就是由字符组成的数组，并带有指向末尾的指针）可以使代码更好。 There are a couple of ways to manage stacks in python, with the most familiar being a list, and another good option being a collections.deque. 有两种方法可以在python中管理堆栈，最常见的方法是列表，而另一个好的选择是collections.deque。 Unless a profiler says otherwise, I would go with the more familiar list. 除非探查器另有说明，否则我会选择更熟悉的列表。

def canonicalize(s):
    normalized_s = list()
    for c in s:
        # Check for a backspace, taking care not to run off the end of the string.
        if c == BACKSPACE:
            if normalized_s:
                normalized_s.pop()
        else:
            normalized_s.append(c)

    return "".join(normalized_s)

The final compare method could look something like 最终的比较方法可能看起来像

def compare(s1, s2):
    return canonicalize(s1) == canonlicalize(s2)

I have two problems with the above code. 上面的代码有两个问题。 The first is that it is pretty much guaranteed to create two new strings. 首先是可以保证创建两个新字符串。 The second is that it needs four total passes over the strings, one for each of input strings, and one for each of the cleaned up strings. 第二个是它需要在字符串上进行四次传递，每条输入字符串一次，每条清理字符串一次。

This can be improved by going backwards instead of forwards. 可以通过后退而不是前进来改善。 By iterating backwards, you can see the backspaces, and know ahead of time which characters will be deleted (read ignored or skipped). 通过向后迭代，您可以看到退格键，并提前知道哪些字符将被删除（忽略或跳过读取）。 We keep going until a mismatch, or at least one string runs out of characters. 我们一直努力直到不匹配，或者至少一个字符串用完了字符。 This method requires a little more bookkeeping but needs no extra space. 此方法需要更多的簿记，但不需要额外的空间。 It uses just two pointers to keep track of the current progress through each string, and a counter to track the number of characters to ignore. 它仅使用两个指针来跟踪每个字符串的当前进度，并使用一个计数器来跟踪要忽略的字符数。 The code as presented below is not particularly pythonic, it could be made much nicer. 如下所示的代码并不是特别的pythonic代码，可以做得更好。 You can strip all the boilerplate away if you were to use (two) generators and an izip_longest. 如果要使用（两个）生成器和一个izip_longest，则可以剥离所有样板。

def compare(s1, s2):
    i, j = len(s1) - 1, len(s2) - 1

    while i >= 0 or j >= 0:
        ignore = 0
        while i >= 0:
            if s1[i] == BACKSPACE:
                ignore += 1
            elif ignore > 0:
                ignore -= 1
            else:
                break
            i -= 1

        ignore = 0
        while j >= 0:
            if s2[j] == BACKSPACE:
                ignore += 1
            elif ignore > 0:
                ignore -= 1
            else:
                break
            j -= 1

        if i < 0 and j < 0:
            # No more characters to try and match
            return True

        if (i < 0 and j >= 0) or (i >= 0 and j < 0):
            # One string exhausted before the other
            return False

        if s1[i] != s2[j]:
            return False

        i -= 1
        j -= 1

    return True

EDIT 编辑

Here are some test cases I tried for the last implementation of compare. 这是我为比较的最后一个实现尝试的一些测试用例。

true_testcases = (
    ("abc", "abc"),
    ("abc", "abcde\b\b"),
    ("abcdef", "\b\babcdef\bf"),
    ("", "\b\b\b"),
    ("Toronto", "Torooo\b\bntt\bo"))

false_testcases = (
    ("a", "a\b"),
    ("a", "a\b\b"),
    ("abc", "abc\bd\be"),
)

print([eq(s1, s2) for s1, s2 in true_testcases])
print([eq(s1, s2) for s1, s2 in false_testcases])

如何更正比较包含按键的两个字符串的算法？

问题描述

2 个解决方案

解决方案1
2 2019-02-12 22:58:02

解决方案2
2 已采纳 2019-02-13 00:40:10

如何更正比较包含按键的两个字符串的算法？

问题描述

2 个解决方案

解决方案1 2 2019-02-12 22:58:02

解决方案2 2 已采纳 2019-02-13 00:40:10

解决方案1
2 2019-02-12 22:58:02

解决方案2
2 已采纳 2019-02-13 00:40:10