Optimizing Python Algorithm for a particular Use Case using Pre-Computing

Question

I am trying to solve a specific variant of the problem mentioned here:

Given a string s and a string t, check if s is subsequence of t.

I wrote an algorithm that works fine for the above question:

def isSubsequence(s, t):
        """
        :type s: str
        :type t: str
        :rtype: bool
        """
        i = 0

        for x in t:
            if i<len(s) and x==s[i]:
                i = i + 1

        return i==len(s)

Now there is a particular use case:

If there are lots of incoming S, say S1, S2, ... , Sk where k >= 1 Billion, and you want to check one by one to see if T has its subsequence.

There is a hint:

/**
 * If we check each sk in this way, then it would be O(kn) time where k is the number of s and t is the length of t. 
 * This is inefficient. 
 * Since there is a lot of s, it would be reasonable to preprocess t to generate something that is easy to search for if a character of s is in t. 
 * Sounds like a HashMap, which is super suitable for search for existing stuff. 
 */

But the logic seems like inverting the logic of the algorithm above algorithm, if s is traversed and the character is searched in t using hashmap, it will not be always correct as a hashmap of t will have only 1 index for that character and there is no guarantee that the order will be preserved.

So, I am stuck at how to optimize the algorithm for the above use case?

Thanks for your help.

Answer 1

For each i less than len(t) , and each character c that occurs in t , make a mapping from (i,c)->j , where j is the first index >= i that at which c occurs.

Then you can iterate through each Sk, using the map to find the next occurrence of each required character, if it exists.

This is essentially making a deterministic finite automaton that matches subsequences of t ( https://en.wikipedia.org/wiki/Deterministic_finite_automaton ).

Answer 2

You can preprocess t to create a list of all possible subsequences (keep in mind that t will have 2^len(t)-1 subsequences). You can turn this into a hashtable and then iterate over your list of s , checking for each s in the table. the advantage is you don't have to iterate over t for each s .

By the way, if you get stuck on preprocessing t for a list of all subsequences, you should look into powerset and its implementation in python.

Optimizing Python Algorithm for a particular Use Case using Pre-Computing

Question

2 answers

solution1
2 2018-02-27 18:22:25

solution2
1 2018-02-27 18:08:15

Optimizing Python Algorithm for a particular Use Case using Pre-Computing

Question

2 answers

solution1 2 2018-02-27 18:22:25

solution2 1 2018-02-27 18:08:15

solution1
2 2018-02-27 18:22:25

solution2
1 2018-02-27 18:08:15