简体   繁体   中英

std::regex and ignoring flags

After learning basic rules,I specialized my focus on std::regex , creating two console apps: 1. renrem and 2. bfind .
And I decided to create some convenient functions to deal with regex in as easy as possible plus all with std ; named RFC ( = regex function collection )

There are several strange things that always make me surprise, but this one ruined all my attempt and those two console apps.

One of the important functions is count_match that counts number of match inside a string. Here is the full code:

unsigned int count_match( const std::string& user_string, const std::string& user_pattern, const std::string& flags = "o" ){

    const bool flags_has_i = flags.find( "i" ) < flags.size();
    const bool flags_has_g = flags.find( "g" ) < flags.size();

    std::regex::flag_type regex_flag                  = flags_has_i ? std::regex_constants::icase         : std::regex_constants::ECMAScript;
//    std::regex_constants::match_flag_type search_flag = flags_has_g ? std::regex_constants::match_default : std::regex_constants::format_first_only;
    std::regex rx( user_pattern, regex_flag );
    std::match_results< std::string::const_iterator > mr;

    unsigned int counter = 0;
    std::string temp = user_string;
    while( std::regex_search( temp, mr, rx ) ){
        temp = mr.suffix().str();
        ++counter;
    }

    if( flags_has_g ){
        return counter;
    } else {
        if( counter >= 1 ) return 1;
        else               return 0;
    }

}  

First of all, as you can see, the line for search_flag was commented because it is ignored by std::regex_search and I do not know why? since -- the exact flag is accepted for std::regex_repalce . So std::regex_search ignores the format_first_only but std::regex_replace accepts it. Let's it goes.

The main problem is here that the icase flag is also ignored when the pattern is character class -> [] . In fact when the pattern is only capital letter or small letter : [AZ] or [az]

Supposing this string s = "ONE TWO THREE four five six seven"

the output for std

std::cout << count_match( s, "[A-Z]+" ) << '\n';          // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n';     // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n';    // 3 => Global match plus insensitive  

whereas for the exact and laugauge and with boost the output is:

std::cout << count_match( s, "[A-Z]+" ) << '\n';          // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n';     // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n';    // 7 => Global match plus insensitive  

I know about regex flavors PCRE ; or ECMAScript 262 that c++ uses it, But I have no ides why a simple flag, is ignored for the only search function that c++ has? Since std::regex_iterator and std::regex_token_iterator are also use this function internally.

And shortly, I can not use those two my apps and RFC with std library because if this!

So if someone knows according to which rule it is maybe a valid rude in ECMAScript 262 or perhaps if I am wrong anywhere please tell me. Thanks.


tested with

gcc version 6.3.0 20170519 (Ubuntu/Linaro 6.3.0-18ubuntu2~16.04)
clang version 3.8.0-2ubuntu4  

code:

perl -le '++$c while $ARGV[0] =~ m/[A-Z]+/g; print $c ;' "ONE TWO THREE four five six seven" // 3
perl -le '++$c while $ARGV[0] =~ m/[A-Z]+/gi; print $c ;' "ONE TWO THREE four five six seven" // 7  

code:

uint count_match( ref const (char[]) user_string, const (char[]) user_pattern, const (char[]) flags ){

    const bool flag_has_g = flags.indexOf( "g" ) != -1;

    Regex!( char ) rx = regex( user_pattern, flags );
    uint counter = 0;
    foreach( mr; matchAll( user_string, rx ) ){
        ++counter;
    }

    if( flag_has_g ){
        return counter;
    } else {
        if( counter >= 1 ) return 1;
        else               return 0;
    }
} 

the output:

writeln( count_match( s, "[A-Z]+", "g" ) );  // 3
writeln( count_match( s, "[A-Z]+", "gi" ) ); // 7  

code:

 var s = "ONE TWO THREE four five six seven"; var rx1 = new RegExp( "[AZ]+" , "g" ); var rx2 = new RegExp( "[AZ]+" , "gi" ); var counter = 0; while( rx1.exec( s ) ){ ++counter; } document.write( counter + "<br>" ); // 3 counter = 0; while( rx2.exec( s ) ){ ++counter; } document.write( counter ); // 7 


Okay. After testing with gcc 7.1.0 it turned out that with version below 6.3.0 the output is: 1 3 3 and but with 7.1.0 the output is 1 3 7 here is the link .

Also with this version of clang the output is correct. Here is the link . thanks to igor-tandetnik user

First of all I thought may this is a rule for ECMAScript , but after testing code and seeing Igor Tandetnik commend I test the code with gcc 7.1.0 and it outputs the correct result.

For test the regex library, I use:

std::cout << ( rx.flags() & std::regex_constants::icase == std::regex_constants::icase ? "yes" : "no" ) << '\n';  

So when the icase is set it returns true otherwise returns false . So I think there is no library fault. Here is the test with gcc 7.1.0
Therefore all versions below gcc 7.1.0 has incorrect output.

For clang I have no ideas since I have clang 3.8.0 and it has incorrect output. But the online version even 3.7.1 output is correct.

screenshot with clang 3.8.0 for this code:

std::cout << count_match( s, "[A-Z]+" ) << '\n';          // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n';     // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n';    // 7 => Global match plus insensitive

在此处输入图片说明

So with online compiler the output is incorrect for clang 3.2 and below . But higher version outputs the correct result.

Please correct me if I am wrong

First of all, as you can see, the line for search_flag was commented because it is ignored by std::regex_search and I do not know why? since -- the exact flag is accepted for std::regex_repalce.

The flag in question is format_first_only . This flag makes sense only for a "replace" operation. In regex_replace , the default is "replace all" but if you pass this flag it becomes "replace first only."

In regex_match and regex_search , there is no replacement going on at all; both of those functions just find the first match (and in the case of regex_match , that match must consume the entire string). Since the flag is meaningless in that case, I would expect the implementation to ignore it; but I wouldn't fault the implementation for throwing an exception, either, if it chose to be noisy about it.


The main problem is here that the icase flag is also ignored when the pattern is character class -> []. In fact when the pattern is only capital letter or small letter: [AZ] or [az]

icase working wrong for character classes is definitely a bug in your vendor's library.

  • Looks like libstdc++'s bug was fixed between GCC 6.3 (Dec 2016) and GCC 7.1 (May 2017).
  • Looks like libc++'s bug was fixed between Clang 3.2 (Dec 2012) and Clang 3.3 (Jun 2013).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM