简体   繁体   中英

C++ Boost regex vs Standard Library regex match results

Having difficulty getting the boost regex match results to come out in the same way as the standard library. Meaning the standard library returns the first match in a multi line input that produces multiple matches.

The goal is to get the best performance as the product that runs this code hits it a great deal. The substring calls are horrendously slow and therefore the boost way of doing things.

This product is in C++ prior to C++ 11. old stuff that I can't upgrade.

Example below:

_pattern : [A-Za-z0-9].+\\\\n[ \\t]*\\\\n

Input string: ( the line feeds are essential )

CLINICAL: Left 2cm Firm Fibrous Lump @12:00.

No prior exams were available for comparison.

There is gynecomastia in both feet.

Standard Library version of code:

ORegExpr::index(const OString &inputStr, size_t* length, size_t start = 0) const {
if (start == O_NPOS)
    return O_NPOS;

std::smatch reMatch;    
std::regex re(_pattern);
std::string inputData = "";
if (start > 0 )
    inputData = inputStr._string.substr(start); 
else
    inputData = inputStr._string;

if(std::regex_search(inputData,reMatch,re))
{
  *length = reMatch.length();
  return reMatch.position(0) + start;   
}
*length = 0;
return O_NPOS;
}

**Boost version **

size_t
ORegExpr::index_boost(const OString &inputStr, size_t* length, size_t start = 0) const {
if (start == O_NPOS)
    return O_NPOS;  

boost::regex re(_pattern);

boost::match_results<std::string::const_iterator> what;
boost::match_flag_type flags = boost::match_default;    
std::string::const_iterator s = inputStr.std().begin() + start;    
std::string::const_iterator e = inputStr.std().end();

if(boost::regex_search(s,e,what,re,flags)){
    *length = what.length();        
    return what.position() + start;
}

*length = 0;
return O_NPOS;
}

** replace boost with std to see if using interators would make a difference **

size_t
ORegExpr::index_boostnowstd(const OString &inputStr, size_t* length, size_t start = 0) const {
if (start == O_NPOS)
    return O_NPOS;  

std::regex re(_pattern);

std::match_results<std::string::const_iterator> what;
//boost::match_flag_type flags = boost::match_default;  
std::string::const_iterator s = inputStr.std().begin() + start;    
std::string::const_iterator e = inputStr.std().end();

if(std::regex_search(s,e,what,re)){
    *length = what.length();        
    return what.position() + start;
}

*length = 0;
return O_NPOS;
}

I tried every which way I could to get the "array" of matches and to just return the length of the first match, but for the life of me I couldn't get this from boost. It would return both matches and the total length of both of them, which is the first and second line of the input string.

I have fully functional POC if my explanation isn't as well described as I think it is.

I expect the output of the functions to return a size_t of 46 which is the length of the first line of the input string. Standard library does this but the boost doesn't. The reason for the boost, is that it seems to run faster than the standard library.

Your regular expression is actually matching the first two lines, not the first one alone.

Try this one instead:

"[^\\\\n]+\\\\n\\\\n"

Live Demo (C++03)

This regular expression will match the first occurrence of "no newline characters followed by two newline characters" which will match the first line of your output, giving you a length of 46 (includes newline characters)


Edit: From your comments it appears you're stuck with the given expression.

What you can try to do is to use Boost's match_flag_type to alter how the regular expression works. In this case, using boost::match_any to return the leftmost match.

boost::match_flag_type flags = boost::match_any;

From the doc for match_any :

Specifies that if more than one match is possible then any match is an acceptable result: this will still find the leftmost match, but may not find the "best" match at that position. Use this flag if you care about the speed of matching, but don't care what was matched (only whether there is one or not).

Demo #2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM