简体   繁体   中英

Regex matches under g++ 4.9 but fails under g++-5.3.1

I am tokenizing a string with a regex; this works normally under g++-4.9 , but fails under g++-5.3.1 .

I have the following txt file:

0001-SCAND ==> "Scandaroon" (from Philjumba)
0002-KINVIN ==> "King's Vineyard" (from Philjumba)
0003-HANNI ==> "Hannibal: Rome vs. Carthage" (from Philjumba)
0004-LOX ==> "Lords of Xidit" (from Philjumba)

which I am tokenizing using regular expressions, by spaces, quotation marks pairs and parentheses pairs. For example, the first line should be tokenized as follows:

0001-SCAND
==>
"Scandaroon"
(from Philjumba)

I have written the following std::regex :

std::regex FPAT("(\\S+)|(\"[^\"]*\")|(\\([^\\)]+\\))";

And I am tokenizing the string with:

std::vector<std::string>
split( const std::string & input, const std::regex & regex ) {

        std::sregex_token_iterator
                first{input.begin(), input.end(), regex, 0},
                last;

        return {first, last};
}

This returns the matches. Under g++-4.9 the string is tokenized as requested, but under g++-5.3.1 it's tokenized as follows:

0001-SCAND
==>
"Scandaroon"
(from
Philjumba)

or the third line is tokenized as follows:

0003-HANNI
==>
"Hannibal:
Rome
vs.
Carthage"
(from
Philjumba)

What could the issue be?


edit: I am calling the function as follows:

std::string line("0001-SCAND ==> \"Scandaroon\" (from Philjumba)");
auto elems = split( line, FPAT );

edit: following feedback from @xaxxon, I replaced returning the iterator by a vector, but it's still not working correctly under g++-5.3 .

std::vector<std::string>
split( const std::string & input, const std::regex & regex ) {

        std::sregex_token_iterator
                first{input.begin(), input.end(), regex, 0},
                last;

        std::vector< std::string > elems;
        elems.reserve( std::distance(first,last) );

        for ( auto it = first; it != last; ++ it ) {
                //std::cout << (*it) << std::endl;
                elems.push_back( *it );
        }

        return elems;
}

Regular expression is Eager

so for a regular expression "Set|SetValue" and the text "SetValue" , regex founds "Set" .

You have to choose order carefully:

std::regex FPAT(R"(("[^\"]*\")|(\([^\)])+\)|(\S+))");

\\S+ at the end to be the last considered.

An other alternative is to use not the default option (see http://en.cppreference.com/w/cpp/regex/syntax_option_type ) and use std::::regex::extended

std::regex FPAT(R"((\S+)|("[^\"]*\")|(\([^\)])+\))", std::::regex::extended);

So it seems that g++-5.3.1 has fixed a bug since g++-4.9 in this regard.

You don't post enough for me to know for sure (you updated it showing you are calling it with an lvalue, so this post probably doesn't pertain, but I'll leave it up unless people want me to take it down), but if you're doing what I did, you forgot that the iterators are into the source string and that string is no longer valid.

You could remove the const from input , but it's so damn convenient to be able to put an rvalue there, so.....

Here's what I do to avoid this - I return a unique_ptr to something that looks like the results, but I hide the actual source string along with it so the strsing can't go away before I'm done using it. This is likely UB, but I think it will work virtually all the time:

// Holds a regex match as well as the original source string so the matches remain valid as long as the 
// caller holds on to this object - but it acts just like a std::smatch
struct MagicSmatch {
    std::smatch match;
    std::string data;

    // constructor makes a copy of the string and associates
    // the copy's lifetime with the iterators into the string (the smatch)
    MagicSmatch(const std::string & data) : data(data)
    {}
};

// this deleter knows about the hidden string and makes sure to delete it
// this cast is probably UB because std::smatch isn't a standard layout type
struct MagicSmatchDeleter {
    void operator()(std::smatch * smatch) {
        delete reinterpret_cast<MagicSmatch *>(smatch);
    }
};


// the caller just thinks they're getting a smatch ptr.. but we know the secret
std::unique_ptr<std::smatch, MagicSmatchDeleter> regexer(const std::regex & regex, const std::string & source)
{
    auto magic_smatch = new MagicSmatch(source);
    std::regex_search(magic_smatch->data, magic_smatch->match, regex);
    return std::unique_ptr<std::smatch, MagicSmatchDeleter>(reinterpret_cast<std::smatch *>(magic_smatch));

}

as long as you call it as auto results = regexer(....) then it's quite easy to use, though results is a pointer, not a proper smatch , so the [] syntax doesn't work as nicely.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM