如何在C ++中快速从字符串中查找多重子项并将其子字符串化？

Question

我是C ++的新手，正为以下问题苦苦挣扎：
我正在从iptables解析syslog消息。 每条消息看起来像：
192.168.1.1:20200:Dec 11 15:20:36 SRC=192.168.1.5 DST=8.8.8.8 LEN=250
而且我需要快速（因为很快就会收到新消息）解析字符串以获取SRC，DST和LEN。
如果这是一个简单的程序，我将使用std::find查找STR子字符串的索引，然后在循环中将每个下一个字符添加到数组中，直到遇到空白为止。 然后，我将对DST和LEN进行相同的操作。
例如，

std::string x = "15:30:20 SRC=192.168.1.1 DST=15.15.15.15 LEN=255";
std::string substr;

std::cout << "Original string: \"" << x << "\"" << std::endl;

// Below "magic number" 4 means length of "SRC=" string 
// which is the same for "DST=" and "LEN="    

// For SRC
auto npos = x.find("SRC");
if (npos != std::string::npos) {
    substr = x.substr(npos + 4, x.find(" ", npos) - (npos+4));
    std::cout << "SRC: " << substr << std::endl;
}

// For DST
npos = x.find("DST");
if (npos != std::string::npos) {
    substr = x.substr(npos + 4, x.find(" ", npos) - (npos + 4));
    std::cout << "DST: " << substr << std::endl;
}

// For LEN
npos = x.find("LEN");
if (npos != std::string::npos) {
    substr = x.substr(npos + 4, x.find('\0', npos) - (npos + 4));
    std::cout << "LEN: " << substr << std::endl;
}

但是，在我的情况下，我需要非常快速地执行此操作，最好是一次迭代。
您能给我一些建议吗？

Answer 1

“理想情况下，理想的是一次迭代”-实际上，程序的速度并不取决于源代码中可见的循环数。 尤其是正则表达式是隐藏多个嵌套循环的一种很好的方法。

您的解决方案实际上是非常好的。 查找“ SRC”之前不会浪费很多时间，也不会进行不必要的搜索来检索IP地址。 当然，当搜索“ SRC”时，它对“ Sep”的第一个“ S”具有假阳性，但是可以通过下一个比较来解决。 如果可以肯定地知道“ SRC”的首次出现在第20列中，则跳过前20个字符可以节省一点速度。 （我不能告诉您检查日志）

Answer 2

您可以使用std::regex ，例如：

std::string x = "15:30:20 SRC=192.168.1.1 DST=15.15.15.15 LEN=255";

std::regex const r(R"(SRC=(\S+) DST=(\S+) LEN=(\S+))");
std::smatch matches;
if(regex_search(x, matches, r)) {
    std::cout << "SRC " << matches.str(1) << '\n';
    std::cout << "DST " << matches.str(2) << '\n';
    std::cout << "LEN " << matches.str(3) << '\n';
}

请注意， matches.str(idx)使用匹配项创建一个新字符串。 使用matches[idx] ，可以在不创建新字符串的情况下获得子字符串的迭代器。

Answer 3

如果您的格式是固定的并经过验证（输入的字符串不完全包含预期的字符，您就可以接受未定义的行为），那么您可能会通过手工编写较大的部分并跳过将要执行的字符串终止测试来降低性能成为所有标准功能的一部分。

// buf_ptr will be updated to point to the first character after the " SRC=x.x.x.x" sequence
unsigned long GetSRC(const char*& buf_ptr)
{
    // Don't search like this unless you have a trusted input format that's guaranteed to contain " SRC="!!!
    while (*buf_ptr != ' ' ||
        *(buf_ptr + 1) != 'S' ||
        *(buf_ptr + 2) != 'R' ||
        *(buf_ptr + 3) != 'C' ||
        *(buf_ptr + 4) != '=') 
    {
        ++buf_ptr;
    }
    buf_ptr += 5;
    char* next;

    long part = std::strtol(buf_ptr, &next, 10);
    // part is now the first number of the IP. Depending on your requirements you may want to extract the string instead
    unsigned long result = (unsigned long)part << 24;

    // Don't use 'next + 1' like this unless you have a trusted input format!!!
    part = std::strtol(next + 1, &next, 10);
    // part is now the second number of the IP. Depending on your requirements ...
    result |= (unsigned long)part << 16;

    part = std::strtol(next + 1, &next, 10);
    // part is now the third number of the IP. Depending on your requirements ...
    result |= (unsigned long)part << 8;

    part = std::strtol(next + 1, &next, 10);
    // part is now the fourth number of the IP. Depending on your requirements ...
    result |= (unsigned long)part;

    // update the buf_ptr so searching for the next information ( DST=x.x.x.x) starts at the end of the currently parsed parts
    buf_ptr = next;
    return result;
}

用法：

const char* x_str = x.c_str();
unsigned long srcIP = GetSRC(x_str);
// now x_str will point to " DST=15.15.15.15 LEN=255" for further processing

std::cout << "SRC=" << (srcIP >> 24) << "." << ((srcIP >> 16) & 0xff) << "." << ((srcIP >> 8) & 0xff) << "." << (srcIP & 0xff) << std::endl;

注意，我决定将整个提取的源IP写入单个32位无符号的。 您可以根据需要决定完全不同的存储模型。

即使您不满意自己的格式，也可以使用每次处理零件时都会更新的指针并继续剩余的字符串而不是从0开始，这可能是提高性能的一个好主意。

当然，我想您的std::cout << ...行仅用于开发测试，因为否则所有微优化无论如何都会变得无用。

如何在C ++中快速从字符串中查找多重子项并将其子字符串化？

问题描述

3 个解决方案

解决方案1
1 2017-12-11 12:52:54

解决方案2
1 2017-12-11 12:59:30

解决方案3
1 已采纳 2017-12-11 13:27:14

如何在C ++中快速从字符串中查找多重子项并将其子字符串化？

问题描述

3 个解决方案

解决方案1 1 2017-12-11 12:52:54

解决方案2 1 2017-12-11 12:59:30

解决方案3 1 已采纳 2017-12-11 13:27:14

解决方案1
1 2017-12-11 12:52:54

解决方案2
1 2017-12-11 12:59:30

解决方案3
1 已采纳 2017-12-11 13:27:14