简体   繁体   English

C++中正则表达式的特殊字符问题

[英]Problem with special characters with RegEx in C++

I have an issue to replace a special characters in string (from IIS Sharepoint log files) that contains a domain name with forward slash and names that starts with t, n, r that makes confusions with regular expressions.我有一个问题要替换字符串中的特殊字符(来自 IIS Sharepoint 日志文件),其中包含带有正斜杠的域名和以 t、n、r 开头的名称,这会与正则表达式混淆。 my code is as follow:我的代码如下:

std::setlocale(LC_ALL, ".ACP"); //Sets the locale to the ANSI code page obtained from the operating system. FR characters
std::string subject("2018-08-26 11:38:20 172.20.1.148 GET /BaseDocumentaire/Documents+de+la+page+Notes+de+services/Rappel+du+dispositif+de+Sécurité+relatif+aux+Moyens+de+paiement+et+d’épargne+en+agence.pdf - 80 0#.w|domainname\tonzaro 10.12.105.24 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:61.0)+Gecko/20100101+Firefox/61.0 200 0 0 29984");
std::string result;
std::string g1, g2, g5, g9, g10; //str groups in regex

    try {
        std::regex re("(\\d{4}-\\d{2}-\\d{2})( \\d{2}:\\d{2}:\\d{2})( 172.20.1.148)( GET | POST | HEAD )((/.*){1,4}/.*.(pdf|aspx))( -.*)(domainname.[a-zA-Z0-9]*)( \\d+.\\d+.\\d+.\\d+)");
        std::sregex_iterator next(subject.begin(), subject.end(), re);
        std::sregex_iterator end;

        while (next != end) {
            std::smatch match = *next;
            std::cout << match.str() << "\n";
            std::cout << "-------------------------------------------" << "\n";
            g1 = match.str(1);
            g2 = match.str(2);
            g5 = match.str(5);
            g9 = match.str(9);
            g10 = match.str(10);
            next++;
        }

        std::cout << "Date: " + g1 << "\n";
        std::cout << "Time: " + g2 << "\n";
        std::replace(g5.begin(), g5.end(), '+', ' ');
        std::cout << "Link Document : " + g5 << "\n";
        std::cout << "User: " + g9 << "\n";
        std::cout << "IP: " + g10 << "\n";

    }
    catch (std::regex_error& e) {
        std::cout << "Syntax error in the regular expression" << "\n";
    }

My output for domain name is: domainname onzaro我的域名输出是: domainname onzaro

Any help please for this problem with \\, \\t, \\n or \\r ?请帮助解决 \\, \\t, \\n 或 \\r 的这个问题?

I'd urge you to use raw string literals .我强烈建议您使用原始字符串文字 This is solution designed for cases where the literal should not be processed in any way, such as yours.这是为不应以任何方式处理文字的情况而设计的解决方案,例如您的情况。

The syntax is R "delimiter( raw_characters )delimiter" , so in your case it could be:语法是R "delimiter( raw_characters )delimiter" ,所以在你的情况下它可能是:

std::string subject(R"raw(2018-08-26 11:38:20 172.20.1.148 GET /BaseDocumentaire/Documents+de+la+page+Notes+de+services/Rappel+du+dispositif+de+Sécurité+relatif+aux+Moyens+de+paiement+et+d’épargne+en+agence.pdf - 80 0#.w|domainname\tonzaro 10.12.105.24 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:61.0)+Gecko/20100101+Firefox/61.0 200 0 0 29984)raw");
std::regex re( R"raw((\d{4}-\d{2}-\d{2})( \d{2}:\d{2}:\d{2})( 172.20.1.148)( GET | POST | HEAD )((/.*){1,4}/.*.(pdf|aspx))( -.*)(domainname.[a-zA-Z0-9]*)( \d+.\d+.\d+.\d+))raw");

(I might have missed some superfluous \\ above). (我可能错过了上面一些多余的\\ )。 See it live.现场观看。

Those special characters are called escape sequences are being processed in string literals at compilation level (in phase 5 to be precise).这些特殊字符被称为转义序列,正在编译级别在字符串文字中进行处理(准确地说是在 第 5 阶段)。 For raw string literals this transformation is suppressed.对于原始字符串文字,此转换被抑制。

You don't care about any special character handling.你不关心任何特殊的字符处理。 You just need to take care that ")delimiter" doesn't appear in your literal, which I imagine could happen in regex.您只需要注意")delimiter"不会出现在您的文字中,我想这可能会发生在正则表达式中。

'\\t' is one character, a horizontal tab. '\\t'是一个字符,一个水平制表符。 If you want the characters \\ and t , you need to escape the backslash: "\\\\t" .如果你想要字符\\t ,你需要转义反斜杠: "\\\\t"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM