简体   繁体   English

C ++ substr,优化速度

[英]C++ substr, optimizing speed

So before few days I started learning C++. 所以在几天之前,我开始学习C ++。 I'm writing a simple xHTML parser, which doesn't contain nested tags. 我正在编写一个不包含嵌套标记的简单xHTML解析器。 For testing I have been using the following data: http://pastebin.com/bbhJHBdQ (around 10k chars). 为了进行测试,我一直在使用以下数据: http : //pastebin.com/bbhJHBdQ (大约1万个字符)。 I need to parse data only between p, h2 and h3 tags. 我只需要解析p,h2和h3标签之间的数据。 My goal is to parse the tags and its content into the following structure: 我的目标是将标签及其内容解析为以下结构:

struct Node {
    short tag; // p = 1, h2 = 2, h3 = 3
    std::string data;
};

for example <p> asdasd </p> will be parsed to tag = 1, string = "asdasd" . 例如<p> asdasd </p>将被解析为tag = 1, string = "asdasd" I don't want to use third party libs and I'm trying to do speed optimizations. 我不想使用第三方库,并且正在尝试进行速度优化。

Here is my code: 这是我的代码:

short tagDetect(char * ptr){
    if (*ptr == '/') {
        return 0;
    }

    if (*ptr == 'p') {
        return 1;
    }

    if (*(ptr + 1) == '2')
        return 2;

    if (*(ptr + 1) == '3')
        return 3;

    return -1;
}


struct Node {
    short tag;
    std::string data;

    Node(std::string input, short tagId) {
        tag = tagId;
        data = input;
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    std::string input = GetData(); // returns the pastebin content above
    std::vector<Node> elems;

    String::size_type pos = 0;
    char pattern = '<';

    int openPos;
    short tagID, lastTag;

    double  duration;
    clock_t start = clock();

    for (int i = 0; i < 20000; i++) {
        elems.clear();

        pos = 0;
        while ((pos = input.find(pattern, pos)) != std::string::npos) {
            pos++;
            tagID = tagDetect(&input[pos]);
            switch (tagID) {
            case 0:
                if (tagID = tagDetect(&input[pos + 1]) == lastTag && pos - openPos > 10) {
                    elems.push_back(Node(input.substr(openPos + (lastTag > 1 ? 3 : 2), pos - openPos - (lastTag > 1 ? 3 : 2) - 1), lastTag));
                }

                break;
            case 1:
            case 2:
            case 3:
                openPos = pos;
                lastTag = tagID;
                break;
            }
        }

    }

    duration = (double)(clock() - start) / CLOCKS_PER_SEC;
    printf("%2.1f seconds\n", duration);
}

My code is in loop in order to performance test my code. 我的代码处于循环中,以便对我的代码进行性能测试。 My data contain 10k chars. 我的数据包含1万个字符。

I have noticed that the biggest "bottleneck" of my code is the substr. 我注意到代码中最大的“瓶颈”是substr。 As presented above, the code finishes executing in 5.8 sec . 如上所述,代码在5.8 sec完成执行。 I noticed that if I reduce the strsub len to 10, the execution speed gets reduce to 0.4 sec . 我注意到,如果将strsublen减小为10,执行速度将减小为0.4 sec If I replace the whole substr with "" my code finishes in 0.1 sec . 如果将整个substr替换为""代码将在0.1 sec完成。

My questions are: 我的问题是:

  • How can I optimize the substr, because it's the main bottleneck to my code? 因为这是我的代码的主要瓶颈,我如何优化substr?
  • Are there any other optimization I can make to my code? 我可以对代码进行其他优化吗?

I'm not sure if this question is fine for SO, but I'm pretty new in C++ and I don't have idea who to ask if my code is complete crap. 我不确定这个问题是否适合SO,但是我在C ++中是个新手,我也不知道谁问我的代码是否完整。

Full source code can be found here: http://pastebin.com/dhR5afuE 完整的源代码可以在这里找到: http : //pastebin.com/dhR5afuE

Instead of storing substrings, you could store data which refers to sections in the original string (either via pointers, iterators or integer indexes). 除了存储子字符串,您还可以存储引用原始字符串中各节的数据(通过指针,迭代器或整数索引)。 You just have to be careful that the original string stays intact for as long as the reference data is used. 您只需要注意,只要使用参考数据,原始字符串就保持完整。 Take a hint from boost::string_ref even if you're unwilling to use it directly. 即使您不愿意直接使用boost::string_ref也可以使用它。

There are better substring algorithms than just a linear search, which is O(MxN) . 有比线性搜索更好的子字符串算法,它是O(MxN) Look up the Boyer-Moore and Knuth-Morris-Platt algorithms. 查找Boyer-Moore和Knuth-Morris-Platt算法。 I tested these years ago and BM won. 这些年前,我进行了测试,并且BM获得了冠军。

You could also consider using a regular expression, which is more expensive to set up but could be more efficient in the actual search than your linear search. 您还可以考虑使用正则表达式,它的设置成本较高,但在实际搜索中比线性搜索更有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM