简体   繁体   中英

Fast way to find substring in text using suffix array and lcp

I'm trying to find words which contains substring (as input) in huge text. The text looks like this: *america*python*erica*escape*.. Example: Input: "rica" => Output: america,erica

I use suffix array.

My pseudocode (pythonlike) is:

firstChar=input[0] // the first character of input
suffixArray=getSuffixArray(text) // suffix array
result=[]

for every index of suffix array which points to firstChar:
    length=len(input)
    indexText=text[suffixArray[index]]
    indexes=[]

    if input in text[indexText: indexText+length]:
        word=find whole word containig this index between '*' 
        result.append(word)

This works, but it is too slow. LCP array should improve a runtime of algorhitm but I can't figure out how. Will you give me an advice?

Thanks in advance!

A free Python code for suffix array is at Effcient way to find longest duplicate string . It works up to 100 million characters on a personal computer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM