How do I correct the algorithm for comparing two strings containing keypresses?

Question

This is the algorithm to return true if two strings are equivalent. The string may contain keypresses like backspace. The code uses a cursor and pointers to go through each letter in the strings and skip 2 positions if it finds a keypress (ie\\b)

#!/usr/bin/env python
import argparse
import __builtin__

# Given two different strings, one with backspaces (keypresses), find if they are equivalent or not

def main():
    parser = argparse.ArgumentParser(description="Enter two strings without or without backspaces")
    parser.add_argument("s1", type=str, help="The first string.")
    parser.add_argument("s2", type=str, help="The second string.")
    args = parser.parse_args()
    print(compare(args.s1, args.s2))

def compare(s1, s2):
    BACKSPACE = '\b'
    cursor = 0;
    pointer1 = 0; pointer2 = 0; # current position in backspaced string. 

    canon_len1 = len(s1); canon_len2 = len(s2); # length of the canonical string

    num_diff = 0
    while True:
        if s1[pointer1] == BACKSPACE or s2[pointer2] == BACKSPACE:
            # decrement the cursor and undo the previous compare
            cursor -= 1; 
            if s1[cursor] != s2[cursor]:
                num_diff -= 1
            # decrement the canonical lengths appropriately
            canon_len1 -= 2 if s1[pointer1] == BACKSPACE else 0
            canon_len2 -= 2 if s2[pointer2] == BACKSPACE else 0
        else:

            if s1[pointer1] != s2[pointer2]:
                num_diff += 1
            cursor += 1

        # increment the pointers, making sure we don't run off then end 
        pointer1 += 1; pointer2 += 1;
        if pointer1 == len(s1) and pointer2 == len(s2):
            break
        if pointer1 == len(s1): pointer1 -= 1
        if pointer2 == len(s2): pointer2 -= 1

    return num_diff == 0 and canon_len1 == canon_len2

if __name__ == "__main__":
    main()

#!/usr/bin/env python

import compare_strings
import unittest

class compare_strings_test(unittest.TestCase):

    def test_01(self):
        raised = False
        try:
            compare_strings.compare('Toronto', 'Cleveland')
        except:
            raised = True
        self.assertFalse(raised, 'Exception raised')

    def test_02(self):
        equivalent = compare_strings.compare('Toronto', 'Cleveland')
        self.assertEquals(equivalent, False)

    def test_03(self):
        equivalent = compare_strings.compare('Toronto', 'Toroo\b\bnto')
        self.assertEquals(equivalent, False)

    def test_04(self):
        equivalent = compare_strings.compare('Toronto', 'Torooo\b\bntt\bo')
        self.assertEquals(equivalent, True)

if __name__ == "__main__":
    unittest.main()

...F
======================================================================
FAIL: test_04 (__main__.compare_strings_test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "compare_strings_test.py", line 26, in test_04
    self.assertEquals(equivalent, True)
AssertionError: False != True

----------------------------------------------------------------------
Ran 4 tests in 0.001s

Test 4 fails, but 'Toronto' and 'Torooo\\b\\bntt\\bo' should be equivalent minus the backspaces

Answer 1

Better remove backspaces from string beforehand with a function like:

def normalize(s):
    result = []
    for c in s:
        if c == '\b':
            result.pop()  // A try-catch block could be added here
        else:
            result.append(c)

    return "".join(result)

and compare then.

Answer 2

I believe the issue in your current code stems from the fact you can have multiple backspaces in a row, yet you only look back "one" character. (I may be wrong on this, I haven't stepped through the code with pdb.)

As suggested in the comments, a decent method to break down this problem would be to split it into the following two parts.

Canonicalize/Normalize both of the input strings. This means processing them one at a time, stripping out backspace and the relevant previous characters from each string.
Compare the two normalized strings.

Step 2 is easy, just use the built-in string compare method (== in python).

Step 1 is a little harder as you may potentially have multiple backspaces in a row in the input string. One way to handle this is to build up a new string one character at a time, and on each backspace, delete the last added character. Here is some sample code.

def canonicalize(s):
    normalized_s = ""
    for i, c in enumerate(s):
        # Check for a backspace, taking care not to run off the end of the string.
        if c == BACKSPACE:
            normalized_s = normalized_s[:-1]
        else:
            normalized_s += c

    return normalized_s

One nice side effect with this approach is that leading backspaces do not cause any errors, they are ignored. I'll try to keep this property in other implementations later. This code in a language like c++ where strings can be modified could be made efficient rather easily since it would be akin to changing a pointer and the entries into a char array.

In python, each edit would create a new string, (or at least, there is no guarantee it would not allocate a new string). I think minding your own stack (aka an array consisting of characters with a pointer to the end) could make for better code. There are a couple of ways to manage stacks in python, with the most familiar being a list, and another good option being a collections.deque. Unless a profiler says otherwise, I would go with the more familiar list.

def canonicalize(s):
    normalized_s = list()
    for c in s:
        # Check for a backspace, taking care not to run off the end of the string.
        if c == BACKSPACE:
            if normalized_s:
                normalized_s.pop()
        else:
            normalized_s.append(c)

    return "".join(normalized_s)

The final compare method could look something like

def compare(s1, s2):
    return canonicalize(s1) == canonlicalize(s2)

I have two problems with the above code. The first is that it is pretty much guaranteed to create two new strings. The second is that it needs four total passes over the strings, one for each of input strings, and one for each of the cleaned up strings.

This can be improved by going backwards instead of forwards. By iterating backwards, you can see the backspaces, and know ahead of time which characters will be deleted (read ignored or skipped). We keep going until a mismatch, or at least one string runs out of characters. This method requires a little more bookkeeping but needs no extra space. It uses just two pointers to keep track of the current progress through each string, and a counter to track the number of characters to ignore. The code as presented below is not particularly pythonic, it could be made much nicer. You can strip all the boilerplate away if you were to use (two) generators and an izip_longest.

def compare(s1, s2):
    i, j = len(s1) - 1, len(s2) - 1

    while i >= 0 or j >= 0:
        ignore = 0
        while i >= 0:
            if s1[i] == BACKSPACE:
                ignore += 1
            elif ignore > 0:
                ignore -= 1
            else:
                break
            i -= 1

        ignore = 0
        while j >= 0:
            if s2[j] == BACKSPACE:
                ignore += 1
            elif ignore > 0:
                ignore -= 1
            else:
                break
            j -= 1

        if i < 0 and j < 0:
            # No more characters to try and match
            return True

        if (i < 0 and j >= 0) or (i >= 0 and j < 0):
            # One string exhausted before the other
            return False

        if s1[i] != s2[j]:
            return False

        i -= 1
        j -= 1

    return True

EDIT

Here are some test cases I tried for the last implementation of compare.

true_testcases = (
    ("abc", "abc"),
    ("abc", "abcde\b\b"),
    ("abcdef", "\b\babcdef\bf"),
    ("", "\b\b\b"),
    ("Toronto", "Torooo\b\bntt\bo"))

false_testcases = (
    ("a", "a\b"),
    ("a", "a\b\b"),
    ("abc", "abc\bd\be"),
)

print([eq(s1, s2) for s1, s2 in true_testcases])
print([eq(s1, s2) for s1, s2 in false_testcases])

How do I correct the algorithm for comparing two strings containing keypresses?

Question

2 answers

solution1
2 2019-02-12 22:58:02

solution2
2 ACCPTED 2019-02-13 00:40:10

How do I correct the algorithm for comparing two strings containing keypresses?

Question

2 answers

solution1 2 2019-02-12 22:58:02

solution2 2 ACCPTED 2019-02-13 00:40:10

solution1
2 2019-02-12 22:58:02

solution2
2 ACCPTED 2019-02-13 00:40:10