简体   繁体   中英

O(n) substring algorithm

so i've been researching about substring searching algorithms and found out that most algorithms like the kmp and the rabin-karp algorithm need an extra amount of time complexity for preprocessing time before doing some string matching. is there any benefit in doing so? and why wouldn't they simply skip to the string matching instantly so that the big-O time complexity does not drop to O(m+n)? I tried creating a substring algorithm that i believe is O(n) (please correct me if i'm wrong), by simply skipping the preprocessing time. And i'm wondering why do people don't do it this way instead, please refer to the C code below.

int search(char hay[], char needle[], int hayLen, int needleLen){
    int found;
    int i = 0;

    while (i < (hayLen - needleLen + 1)){
        if (hay[i] == needle[0]){
            found = 1;
            for (int j=0; j<needleLen; j++){
                if (hay[i] != needle[j]){
                    found = 0;
                    break;
                }
                i++;
            }
            if (found)
                return i - needleLen;
        }
        else
            i++;
    }
    return -1;
}

edit:

removed the strlen function to avoid any unwanted time complexities

Honestly not a terrible question. I think most of us have tried making a solution like this when trying to make a string-finding algorithm before discovering KMP. The answer is that this greedy algorithm doesn't work — it never goes backwards in i . You may think “aha! this is the start of the needle!” and progress forwards until discovering “uh-oh! this isn't the whole needle!”. In this algorithm, we then progress only forwards, continuing to search for the start of the needle. However, the start of the actual needle may have been what you thought was a middle character while trying to greedily match as much of the needle as possible.

For example, aab and aaab . It's not until the third a that you realize “uh-oh, this isn't the needle after all”, and a thorough O(nm) algorithm then starts again from the second position, but your algorithm just marches forward, and never realizes the aab that starts on the second position. KMP solves this by kind of noting which parts of the needle in the middle could also be potential starting points for the needle.

Well, your current code is O(n) but ...

Your code doesn't work!

Try this:

int main()
{
    char a[] = "aaaab";
    char b[] = "aaab";
    if (search(a, b, strlen(a), strlen(b)) != -1) 
        printf("OK\n"); 
    else 
        printf("FAIL\n");
    return 0;
}

Obviously b can be found in a but your code says it isn't present.

The problem is that you always increment i . By doing that you do get O(n) but it also makes the code fail.

removed the strlen function to avoid any unwanted time complexities

You removed the strlen call(s), but now the length of the strings has to be passed into the function:

int search(char hay[], char needle[], int hayLen, int needleLen)

So... how does the complexity of the whole substring search change as the size of needle increases? After all, whether you calculate the length inside the function or outside the function, it still needs to be done. O(m+n) means that the complexity depends on the lengths of both needle and haystack .

To take the point to an extreme, you could write an O(1) search function by just adding a parameter that indicates the location of needle in haystack .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM