简体   繁体   中英

Backslash in regular expression bracket expression

Given the regular expression "[\\^]" should it match the strings "\\" and "^"?

My reading of the relevant C++, POSIX, and ECMAScript standards is that for the POSIX (basic, extended, awk, gre, and egrep) syntaxes, the regex should match both strings, and for ECMAScript syntax only the second string should be matched.

The POSIX references for EREs and the awk, grep, and egrep utilities all defer to the BRE specification ( XBD 9.3.5/1 ) which says explicitly "The special characters '.', '*', '[', and '\\' (period, asterisk, left-bracket, and backslash, respectively) shall lose their special meaning within a bracket expression." so I interpret that to mean that a backslash is just a backslash once inside a bracket expression.

The ECMAScript specification does not have the 'lose its special meaning' rule but instead specifies that a backslash followed by a non-alphanumeric character is just the character itself.

The GCC standard library (libstdc++) matches neither string, regardless of the regex syntax chosen. The LLVM standard library (libc++) matches the way I expect with the ECMAScript syntax but raises an exception when constructing the regex with any other syntax ("invalid escaped character").

Here's some code.

#include <iostream>
#include <regex>
#include <string>

void
do_match(std::string const& label, std::regex_constants::syntax_option_type type)
{
    try {
        std::regex re("[\\^]*", type);
        std::cmatch m;
        if (std::regex_match("\\^", m, re)) {
            for (auto res: m) {
                std::cerr << label << " match: " << res << "\n";
            }
        } else {
            std::cerr << label << " no match\n";
        }
    } catch (std::regex_error const& ex) {
        std::cerr << "caught exception: " << ex.what() << "\n";
    }
}

int
main()
{
    do_match("awk", std::regex_constants::awk);
    do_match("ecma", std::regex_constants::ECMAScript);
}

Are my expectations wrong, and if not, which standard library implementation is correct?

Given the regular expression "[\\^]" should it match the strings "\\" and "^"?

using std::regex_constants

  1. ECMAScript , awk - No, it will not match. The \\^ is escaping ^ , so the [\\^] is interpreted as [^] (The "removal of escapes characters" (ie. substituting \\^ for ^ ) comes before "parsing [ set). The ^ character is the first character after [ bracket, so it is interpreted as "negation" (I call it like that), so the bracket will match anything except for the list. As the list is empty [^<this list here>] , it will anything except an empty list... Well, it will match nothing.

  2. basic , grep , extended , egrep - it will match both strings. The \\ loose escaping meaning inside the [ . So [\\^] will literally match \\ or ^ .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM