简体   繁体   中英

Longest common substring via suffix array: uses of sentinel

I am reading about the (apparently) well known problem of the longest common substring in a series of strings, and have been following these two videos which talk about how to solve the problem using suffix arrays: (note that this question doesn't require you to watch them):

https://youtu.be/Ic80xQFWevc

https://youtu.be/DTLjHSToxmo

The first step is that we start by concatenating all the source strings into one big one, separating each with a 'unique' sentinel, where the ASCII code of each sentinel is less than that of any character that may occur in any string. So we could have the individual strings

abca
bcad
daca

and concatenate them to give

abca#bcad$daca%

Now, there are only a limited number of possible sentinels, which leads to problems if we have a large number of strings. Indeed, someone has pointed this out on the first linked video, the response to which was

Correct, the solution is to map your alphabet to the natural numbers and shift up by the number of sentinels you need. This allows you to always have sentinels between the values say [1,N] and your alphabet above that. This trick makes the suffix array scaleable, but you need to undo the shift the decode the true value stored in the suffix array.

I don't understand what the answer means.

I know I could post my question on the video, but I am not guaranteed a (timely) response and the audience here is far wider, so am asking people here : could someone please explain what this answer means and how to implement it?

Not sure how to explain it better/different than in the quoted comment. Maybe an example will help. Note that I am not using the true ASCII codes here as I do not want to show an example with ~100 source strings. So instead, we will just assume A=1, B=2, C=3, etc.

Thus, your source strings abca bcad daca would translate to [1,2,3,1],[2,3,1,4],[4,1,3,1] , but in order to fit in the three sentinels, you have to shift all those values up by 3, ie 1 to 3 are now sentinels and A=4, B=5, etc.; the joined "string" (actually, it is a list of integers now) is [4,5,6,4, 1, 5,6,4,7, 2, 7,4,6,4, 3] . You can then translate those back to characters defda... , do the algorithm, and then translate back, undoing the shift.

However, I would argue that instead of shifting the integers, we could just as well use negative numbers for the sentinels and then work directly on the list of integers instead of converting those back to characters (which is not possible for negative numbers): [1,2,3,1, -1, 2,3,1,4, -2, 4,1,3,1, -3] (Note: I have not watched the video and do not know how this specific algorithm works; it could be that negative numbers are a problem, eg in case this is using some sort of "shortest path" algorithm.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM