C ++ substr，優化速度

Question

所以在幾天之前，我開始學習C ++。 我正在編寫一個不包含嵌套標記的簡單xHTML解析器。 為了進行測試，我一直在使用以下數據： http : //pastebin.com/bbhJHBdQ （大約1萬個字符）。 我只需要解析p，h2和h3標簽之間的數據。 我的目標是將標簽及其內容解析為以下結構：

struct Node {
    short tag; // p = 1, h2 = 2, h3 = 3
    std::string data;
};

例如<p> asdasd </p>將被解析為tag = 1, string = "asdasd" 。 我不想使用第三方庫，並且正在嘗試進行速度優化。

這是我的代碼：

short tagDetect(char * ptr){
    if (*ptr == '/') {
        return 0;
    }

    if (*ptr == 'p') {
        return 1;
    }

    if (*(ptr + 1) == '2')
        return 2;

    if (*(ptr + 1) == '3')
        return 3;

    return -1;
}


struct Node {
    short tag;
    std::string data;

    Node(std::string input, short tagId) {
        tag = tagId;
        data = input;
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    std::string input = GetData(); // returns the pastebin content above
    std::vector<Node> elems;

    String::size_type pos = 0;
    char pattern = '<';

    int openPos;
    short tagID, lastTag;

    double  duration;
    clock_t start = clock();

    for (int i = 0; i < 20000; i++) {
        elems.clear();

        pos = 0;
        while ((pos = input.find(pattern, pos)) != std::string::npos) {
            pos++;
            tagID = tagDetect(&input[pos]);
            switch (tagID) {
            case 0:
                if (tagID = tagDetect(&input[pos + 1]) == lastTag && pos - openPos > 10) {
                    elems.push_back(Node(input.substr(openPos + (lastTag > 1 ? 3 : 2), pos - openPos - (lastTag > 1 ? 3 : 2) - 1), lastTag));
                }

                break;
            case 1:
            case 2:
            case 3:
                openPos = pos;
                lastTag = tagID;
                break;
            }
        }

    }

    duration = (double)(clock() - start) / CLOCKS_PER_SEC;
    printf("%2.1f seconds\n", duration);
}

我的代碼處於循環中，以便對我的代碼進行性能測試。 我的數據包含1萬個字符。

我注意到代碼中最大的“瓶頸”是substr。 如上所述，代碼在5.8 sec完成執行。 我注意到，如果將strsublen減小為10，執行速度將減小為0.4 sec 。 如果將整個substr替換為""代碼將在0.1 sec完成。

我的問題是：

因為這是我的代碼的主要瓶頸，我如何優化substr？
我可以對代碼進行其他優化嗎？

我不確定這個問題是否適合SO，但是我在C ++中是個新手，我也不知道誰問我的代碼是否完整。

完整的源代碼可以在這里找到： http : //pastebin.com/dhR5afuE

Answer 1

除了存儲子字符串，您還可以存儲引用原始字符串中各節的數據（通過指針，迭代器或整數索引）。 您只需要注意，只要使用參考數據，原始字符串就保持完整。 即使您不願意直接使用boost::string_ref也可以使用它。

Answer 2

有比線性搜索更好的子字符串算法，它是O（MxN） 。 查找Boyer-Moore和Knuth-Morris-Platt算法。 這些年前，我進行了測試，並且BM獲得了冠軍。

您還可以考慮使用正則表達式，它的設置成本較高，但在實際搜索中比線性搜索更有效。

C ++ substr，優化速度

問題描述

2 個解決方案

解決方案1
3 2014-04-17 00:19:44

解決方案2
2 2014-04-17 00:07:15

C ++ substr，優化速度

問題描述

2 個解決方案

解決方案1 3 2014-04-17 00:19:44

解決方案2 2 2014-04-17 00:07:15

解決方案1
3 2014-04-17 00:19:44

解決方案2
2 2014-04-17 00:07:15