简体   繁体   English

在g ++和boost中的regex_replace()替换字符串中对'\\'的不同处理

[英]different treatment of '\' in regex_replace() replacement string in g++ and boost

after upgrading gcc from version 4.8.5 to version 5.3.1 I thought I could get rid of boost's regex-implementation (boost version 1.54.0) and use the one provided by gcc (it didn't work with gcc before version 4.9 AFAIK). 将gcc从版本4.8.5升级到版本5.3.1之后,我认为我可以摆脱boost的正则表达式实现(升级版本1.54.0)并使用gcc提供的版本(在版本4.9 AFAIK之前它不适用于gcc) )。 However, this turned out to be a problem because those two implementations behave differently: 但是,这结果是一个问题,因为这两个实现的行为不同:

#include <regex>
#include <boost/regex.hpp>
#include <iostream>
#include <string>

int main() {
    std::string s="\\needs_another_backslash";
    std::string reg("^(\\\\)(needs)(.+)");
    std::string rep("\\\\got$3");
    std::regex sr(reg);
    boost::regex br(reg);
    std::cout<<"string before replacement:\n"<<s<<std::endl<<
        "regular expression:\n"<<reg<<std::endl<<
        "replacement string:\n"<<rep<<std::endl<<
        "std::regex_replace:\n"<<std::regex_replace(s,sr,rep)<<std::endl<<
        "boost::regex_replace:\n"<<boost::regex_replace(s,br,rep)<<std::endl;
    return 0;
}

This gives the following output: 这给出了以下输出:

string before replacement: \\needs_another_backslash regular expression: ^(\\\\)(needs)(.+) replacement string: \\\\got$3 std::regex_replace: \\\\got_another_backslash boost::regex_replace: \\got_another_backslash

It seems as if boost treats the '\\' in a replacement string specially whereas gcc does not. 似乎boost特别在替换字符串中处理'\\',而gcc则不然。 Since the magical character for backreference in the replacement string for std::regex_replace is '$' (which it also is in boost as the example proofs), I tend to think that gcc is right. 由于std :: regex_replace的替换字符串中的反向引用的神奇字符是'$'(它也是示例中的boost),我倾向于认为gcc是正确的。 However, in many other programs (like vim eg) it is '\\'. 然而,在许多其他程序(例如vim)中,它是'\\'。 Therefore, boost might have a point in treating '\\' specially. 因此,提升可能有一点特别对待'\\'。 So who is right? 那么谁是对的?

First, the std example actually is not a matter of gcc, but of C++ standard, to which gcc (in this case) is compliant. 首先,std示例实际上不是gcc的问题,而是C ++标准的问题,gcc(在本例中)是兼容的。 The standard states in 28.5.2: 28.5.2中的标准状态:

When a regular expression match is to be replaced by a new string, the new string shall be constructed using the rules used by the ECMAScript replace function in ECMA-262, part 15.5.4.11 String.prototype.replace. 当要用新字符串替换正则表达式匹配时,新字符串应使用ECMA-262第15.5.4.11节String.prototype.replace中ECMAScript替换函数使用的规则构造。 In addition, during search and replace operations all non-overlapping occurrences of the regular expression shall be located and replaced, and sections of the input that did not match the expression shall be copied unchanged to the output string. 此外,在搜索和替换操作期间,应定位和替换正则表达式的所有非重叠出现,并且输入中与表达式不匹配的部分应不加改变地复制到输出字符串。

And ECMA states: ECMA表示:

Otherwise, let newstring denote the result of converting replaceValue to a String. 否则,让newstring表示将replaceValue转换为String的结果。 The result is a String value derived from the original input String by replacing each matched substring with a String derived from newstring by replacing characters in newstring by replacement text as specified in Table 22. These $ replacements are done left-to-right, and, once such a replacement is performed, the new replacement text is not subject to further replacements. 结果是从原始输入String派生的String值,方法是将每个匹配的子字符串替换为从newstring派生的字符串,方法是用表22中指定的替换文本替换newstring中的字符。这些$替换是从左到右完成的,并且,一旦进行了这样的替换,新的替换文本不再需要进一步替换。 For example, "$1,$2".replace(/(\\$(\\d))/g, "$$1-$1$2") returns "$1-$11,$1-$22". 例如,“$ 1,$ 2”.replace(/(\\ $(\\ d))/ g,“$$ 1- $ 1 $ 2”)返回“$ 1- $ 11,$ 1- $ 22”。 A $ in newstring that does not match any of the forms below is left as is. 新闻字符串中的$ $与下面的任何表格都不匹配保留原样。

(If part: replaceValue is a function.) (如果part:replaceValue是一个函数。)

Nothing mentioned about escape sequences being replaced. 没有提到有关转义序列的替换。 Tried out with firefox: 用firefox试过:

var test = "\\needs_another_backslash";
test = test.replace(/^(\\)(needs)(.+)/, "\\\\got$3");
alert(test);

Result: \\\\got_another_backslash . 结果: \\\\got_another_backslash

boost documentation states: 提升文档状态:

Effects: If fmt is either a null-terminated string, or a container of char_type's, then copies the character sequence [fmt.begin(), fmt.end()) to OutputIterator out. 效果:如果fmt是以null结尾的字符串或char_type的容器,则将字符序列[fmt.begin(),fmt.end())复制到OutputIterator out。 For each format specifier or escape sequence in fmt, replace that sequence with either the character(s) it represents, or the sequence of characters within *this to which it refers. 对于fmt中的每个格式说明符或转义序列,将该序列替换为它所代表的字符,或者它所引用的* this中的字符序列。 The bitmasks specified in flags determines what format specifiers or escape sequences are recognized, by default this is the format used by ECMA-262, ECMAScript Language Specification, Chapter 15 part 5.4.11 String.prototype.replace. flags中指定的位掩码确定识别哪些格式说明符或转义序列,默认情况下,这是ECMA-262,ECMAScript语言规范,第15章第5.4.11节String.prototype.replace使用的格式。

Additionally, it states for match_type_flags : 另外,它声明了match_type_flags

Specifies that when a regular expression match is to be replaced by a new string, that the new string is constructed using the rules used by the ECMAScript replace function in ECMA-262, ECMAScript Language Specification, Chapter 15 part 5.4.11 String.prototype.replace. 指定在用新字符串替换正则表达式匹配时,使用ECMA-262中ECMAScript替换函数使用的规则构建新字符串,ECMAScript语言规范,第15章第5.4.11节String.prototype。更换。 (FWD.1). (FWD.1)。

This is functionally identical to the Perl format string rules. 这在功能上与Perl格式字符串规则相同。

[...] [...]

Tried with perl 5.18.2 on linux: 在linux上尝试使用perl 5.18.2:

my $test = "\\needs_another_backslash";
$test =~ s/^(\\)(needs)(.+)/\\\\got$3/;
print "$test\n";

Resulted in \\\\got_another_backslash . 导致\\\\got_another_backslash

With std::string reg("^(\\\\\\\\)(needs)(.+)"); 使用std::string reg("^(\\\\\\\\)(needs)(.+)"); , as a string literal is passed, reg holds a string ^(\\\\)(needs)(.+) (not a literal, so left out the quotes!), and with std::string rep("\\\\\\\\got$3"); ,当传递一个字符串文字时,reg保存一个字符串^(\\\\)(needs)(.+) (不是文字,所以省略了引号!),并使用std::string rep("\\\\\\\\got$3"); , rep holds \\\\got$3 . ,代表持有\\\\got$3

But there is obviously a difference in interpretation. 但是在解释上显然存在差异。 Assume we had for both std and boost one and the same ECMAScript engine. 假设我们有std和boost两个和相同的ECMAScript引擎。

Then, what both std and boost yet do consistently, is compiling the reg string as regular expression: 然后,std和boost一致的做法是将reg字符串编译为正则表达式:

sprintf(b, "/%s/", reg);
sr /* br, respectively */ = ECMAScriptEngine::compileFromSource(b);

I think this is reflected quite nicely by creating an instance of std/boost::regex class. 我认为通过创建std / boost :: regex类的实例可以很好地反映这一点。

Then comes the difference, however: std passes s , sr and rep to the ECMAScript engine such that it calls directly s.(String.prototype.replace)(sr, rep); 然后是差异:std将ssrrep传递给ECMAScript引擎,使其直接调用s.(String.prototype.replace)(sr, rep); (of course there is no such function for s in reality – just lets assume we could do it this way). (当然,在现实中没有这样的功能 - 只是让我们假设我们可以这样做)。

boost lets compile the rep string, too (side note: I haven't installed boost, so I did not verify this behaviour myself...): boost也可以编译rep字符串(旁注:我没有安装boost,所以我自己也没有验证这个行为......):

sprintf(b, "'%s'", rep); // note: '', not //!
ecma_rep = ECMAScriptEngine::compileFromSource(b);

and then makes the engine call s.(String.prototype.replace)(sr, ecma_rep); 然后进行引擎调用s.(String.prototype.replace)(sr, ecma_rep); .

Interestingly, boost does not compile the source string s, where it again agrees with std... 有趣的是,boost不会编译源字符串s,它再次与std ...

In the end, I think, however, the standard implementation reflects closer what we actually want to do: 但最后,我认为标准实现更接近我们实际想要做的事情:

s.replace(regex, string);
s.replace(/reg/, rep);
(std::string).replace(std::regex(std::string), std::string);
std::regex_replace(s, std::regex(reg), rep);

vs VS

s.replace(regex, string);
s.replace(/reg/, "rep");
(std::string).replace(boost::regex(std::string), boost::???(std::string));
boost::regex_replace(s, boost::regex(reg), rep); // not boost::???(rep)!

Not sure if this is sufficient to say one is right and the other wrong, however (would mean that the wrong one is buggy!). 不确定这是否足以说一个是对的而另一个是错的,但是(这意味着错误的一个是错误的!)。 Possibly we even have to remain with a third option: Both approaches are valid (so both are right and none is wrong) and, unfortunately, they are incompatible... 可能我们甚至必须保留第三种选择:两种方法都是有效的(因此两者都是正确的,没有一种是错误的),不幸的是,它们是不相容的......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM