[英]Problem with special characters with RegEx in C++
I have an issue to replace a special characters in string (from IIS Sharepoint log files) that contains a domain name with forward slash and names that starts with t, n, r that makes confusions with regular expressions.我有一个问题要替换字符串中的特殊字符(来自 IIS Sharepoint 日志文件),其中包含带有正斜杠的域名和以 t、n、r 开头的名称,这会与正则表达式混淆。 my code is as follow:
我的代码如下:
std::setlocale(LC_ALL, ".ACP"); //Sets the locale to the ANSI code page obtained from the operating system. FR characters
std::string subject("2018-08-26 11:38:20 172.20.1.148 GET /BaseDocumentaire/Documents+de+la+page+Notes+de+services/Rappel+du+dispositif+de+Sécurité+relatif+aux+Moyens+de+paiement+et+d’épargne+en+agence.pdf - 80 0#.w|domainname\tonzaro 10.12.105.24 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:61.0)+Gecko/20100101+Firefox/61.0 200 0 0 29984");
std::string result;
std::string g1, g2, g5, g9, g10; //str groups in regex
try {
std::regex re("(\\d{4}-\\d{2}-\\d{2})( \\d{2}:\\d{2}:\\d{2})( 172.20.1.148)( GET | POST | HEAD )((/.*){1,4}/.*.(pdf|aspx))( -.*)(domainname.[a-zA-Z0-9]*)( \\d+.\\d+.\\d+.\\d+)");
std::sregex_iterator next(subject.begin(), subject.end(), re);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str() << "\n";
std::cout << "-------------------------------------------" << "\n";
g1 = match.str(1);
g2 = match.str(2);
g5 = match.str(5);
g9 = match.str(9);
g10 = match.str(10);
next++;
}
std::cout << "Date: " + g1 << "\n";
std::cout << "Time: " + g2 << "\n";
std::replace(g5.begin(), g5.end(), '+', ' ');
std::cout << "Link Document : " + g5 << "\n";
std::cout << "User: " + g9 << "\n";
std::cout << "IP: " + g10 << "\n";
}
catch (std::regex_error& e) {
std::cout << "Syntax error in the regular expression" << "\n";
}
My output for domain name is: domainname onzaro我的域名输出是: domainname onzaro
Any help please for this problem with \\, \\t, \\n or \\r ?请帮助解决 \\, \\t, \\n 或 \\r 的这个问题?
I'd urge you to use raw string literals .我强烈建议您使用原始字符串文字。 This is solution designed for cases where the literal should not be processed in any way, such as yours.
这是为不应以任何方式处理文字的情况而设计的解决方案,例如您的情况。
The syntax is R "delimiter( raw_characters )delimiter"
, so in your case it could be:语法是
R "delimiter( raw_characters )delimiter"
,所以在你的情况下它可能是:
std::string subject(R"raw(2018-08-26 11:38:20 172.20.1.148 GET /BaseDocumentaire/Documents+de+la+page+Notes+de+services/Rappel+du+dispositif+de+Sécurité+relatif+aux+Moyens+de+paiement+et+d’épargne+en+agence.pdf - 80 0#.w|domainname\tonzaro 10.12.105.24 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:61.0)+Gecko/20100101+Firefox/61.0 200 0 0 29984)raw");
std::regex re( R"raw((\d{4}-\d{2}-\d{2})( \d{2}:\d{2}:\d{2})( 172.20.1.148)( GET | POST | HEAD )((/.*){1,4}/.*.(pdf|aspx))( -.*)(domainname.[a-zA-Z0-9]*)( \d+.\d+.\d+.\d+))raw");
(I might have missed some superfluous \\
above). (我可能错过了上面一些多余的
\\
)。 See it live.现场观看。
Those special characters are called escape sequences are being processed in string literals at compilation level (in phase 5 to be precise).这些特殊字符被称为转义序列,正在编译级别在字符串文字中进行处理(准确地说是在 第 5 阶段)。 For raw string literals this transformation is suppressed.
对于原始字符串文字,此转换被抑制。
You don't care about any special character handling.你不关心任何特殊的字符处理。 You just need to take care that
")delimiter"
doesn't appear in your literal, which I imagine could happen in regex.您只需要注意
")delimiter"
不会出现在您的文字中,我想这可能会发生在正则表达式中。
'\\t'
is one character, a horizontal tab. '\\t'
是一个字符,一个水平制表符。 If you want the characters \\
and t
, you need to escape the backslash: "\\\\t"
.如果你想要字符
\\
和t
,你需要转义反斜杠: "\\\\t"
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.