C++ substr, optimizing speed

Question

So before few days I started learning C++. I'm writing a simple xHTML parser, which doesn't contain nested tags. For testing I have been using the following data: http://pastebin.com/bbhJHBdQ (around 10k chars). I need to parse data only between p, h2 and h3 tags. My goal is to parse the tags and its content into the following structure:

struct Node {
    short tag; // p = 1, h2 = 2, h3 = 3
    std::string data;
};

for example <p> asdasd </p> will be parsed to tag = 1, string = "asdasd" . I don't want to use third party libs and I'm trying to do speed optimizations.

Here is my code:

short tagDetect(char * ptr){
    if (*ptr == '/') {
        return 0;
    }

    if (*ptr == 'p') {
        return 1;
    }

    if (*(ptr + 1) == '2')
        return 2;

    if (*(ptr + 1) == '3')
        return 3;

    return -1;
}


struct Node {
    short tag;
    std::string data;

    Node(std::string input, short tagId) {
        tag = tagId;
        data = input;
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    std::string input = GetData(); // returns the pastebin content above
    std::vector<Node> elems;

    String::size_type pos = 0;
    char pattern = '<';

    int openPos;
    short tagID, lastTag;

    double  duration;
    clock_t start = clock();

    for (int i = 0; i < 20000; i++) {
        elems.clear();

        pos = 0;
        while ((pos = input.find(pattern, pos)) != std::string::npos) {
            pos++;
            tagID = tagDetect(&input[pos]);
            switch (tagID) {
            case 0:
                if (tagID = tagDetect(&input[pos + 1]) == lastTag && pos - openPos > 10) {
                    elems.push_back(Node(input.substr(openPos + (lastTag > 1 ? 3 : 2), pos - openPos - (lastTag > 1 ? 3 : 2) - 1), lastTag));
                }

                break;
            case 1:
            case 2:
            case 3:
                openPos = pos;
                lastTag = tagID;
                break;
            }
        }

    }

    duration = (double)(clock() - start) / CLOCKS_PER_SEC;
    printf("%2.1f seconds\n", duration);
}

My code is in loop in order to performance test my code. My data contain 10k chars.

I have noticed that the biggest "bottleneck" of my code is the substr. As presented above, the code finishes executing in 5.8 sec . I noticed that if I reduce the strsub len to 10, the execution speed gets reduce to 0.4 sec . If I replace the whole substr with "" my code finishes in 0.1 sec .

My questions are:

How can I optimize the substr, because it's the main bottleneck to my code?
Are there any other optimization I can make to my code?

I'm not sure if this question is fine for SO, but I'm pretty new in C++ and I don't have idea who to ask if my code is complete crap.

Full source code can be found here: http://pastebin.com/dhR5afuE

Answer 1

Instead of storing substrings, you could store data which refers to sections in the original string (either via pointers, iterators or integer indexes). You just have to be careful that the original string stays intact for as long as the reference data is used. Take a hint from boost::string_ref even if you're unwilling to use it directly.

Answer 2

There are better substring algorithms than just a linear search, which is O(MxN) . Look up the Boyer-Moore and Knuth-Morris-Platt algorithms. I tested these years ago and BM won.

You could also consider using a regular expression, which is more expensive to set up but could be more efficient in the actual search than your linear search.

C++ substr, optimizing speed

Question

2 answers

solution1
3 2014-04-17 00:19:44

solution2
2 2014-04-17 00:07:15

C++ substr, optimizing speed

Question

2 answers

solution1 3 2014-04-17 00:19:44

solution2 2 2014-04-17 00:07:15

solution1
3 2014-04-17 00:19:44

solution2
2 2014-04-17 00:07:15