简体   繁体   English

C ++使用Regex查找子字符串

[英]C++ Use Regex to find substring

I have a string test 我有一个字符串测试

<td><a href="4.%20Functions,%20scope.ppt">4. Functions, scope.ppt</a></td>

I want to find <a href="4.%20Functions,%20scope.ppt"> (as a substring) 我想找到<a href="4.%20Functions,%20scope.ppt"> (作为子字符串)

As a search with Dr.Google: regex e ("<a href=.*?>"); cmatch =cm; 作为使用Dr.Google进行的搜索: regex e ("<a href=.*?>"); cmatch =cm; regex e ("<a href=.*?>"); cmatch =cm; to mark substring that I want to find. 标记我要查找的子字符串。

How I can do next? 我下一步该怎么做?

Am I right to use regex_match(htmlString, cm, e); 我使用regex_match(htmlString, cm, e); with htmlString as wchar_t* 使用htmlString作为wchar_t*

If you want to find all the matching substrings then you need to use the regex iterators: 如果要查找所有匹配的子字符串,则需要使用regex迭代器:

// example data
std::wstring const html = LR"(

<td><a href="4.%20Functions,%20scope.ppt">4. Functions, scope.ppt</a></td>
<td><a href="4.%20Functions,%20scope.ppt">4. Functions, scope.ppt</a></td>
<td><a href="4.%20Functions,%20scope.ppt">4. Functions, scope.ppt</a></td>

)";

// for convenience
constexpr auto fast_n_loose = std::regex_constants::optimize|std::regex_constants::icase;

// extract href's
std::wregex const e_link{LR"~(href=(["'])(.*?)\1)~", fast_n_loose};

int main()
{
    // regex iterators       
    std::wsregex_iterator itr_end;
    std::wsregex_iterator itr{std::begin(html), std::end(html), e_link};

    // iterate through the matches
    for(; itr != itr_end; ++itr)
    {
        std::wcout << itr->str(2) << L'\n';
    }
}

This will match the complete a tag and also get the href attribute value, 这将匹配完整a标签,并获得href属性值,
which is in capture group 2. 在捕获组2中。

It should be done this way because the href attribute can be anywhere in the tag. 应该这样做,因为href属性可以位于标记中的任何位置。

<a(?=(?:[^>"']|"[^"]*"|'[^']*')*?\\shref\\s*=\\s*(?:(['"])([\\S\\s]*?)\\1))\\s+(?:"[\\S\\s]*?"|'[\\S\\s]*?'|[^>]*?)+>

You can substitute [\\w:}+ in place of the a tag to get the href from all tags. 您可以用[\\w:}+代替a标签,以获取所有标签的href

https://regex101.com/r/LHZXUM/1 https://regex101.com/r/LHZXUM/1

Formatted and tested 格式化并测试

 < a                    # a tag, substitute [\w:]+ for any tag

 (?=                    # Asserttion (a pseudo atomic group)
      (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
      \s href \s* = \s* 
      (?:
           ( ['"] )               # (1), Quote
           ( [\S\s]*? )           # (2), href value
           \1 
      )
 )
 \s+ 
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM