简体   繁体   English

C ++中的正则表达式跳过

[英]Regex skip in C++

This is my string: 这是我的字符串:

/*
  Block1 {

    anythinghere
  }
*/

// Block2 { }
# Block3 { }

Block4 {

    anything here
}

I am using this regex to get each block name and inside content. 我正在使用此正则表达式获取每个块的名称和内部内容。

regex e(R"~((\w+)\s+\{([^}]+)\})~", std::regex::optimize);

But this regex gets all inside of description too. 但是该正则表达式也完全包含在描述中。 There is a “skip” option in PHP that you can use to skip all descriptions. PHP中有一个“跳过”选项,可用于跳过所有描述。

What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match

But this is C++ and I cannot use this skip method. 但这是C ++,我不能使用此skip方法。 What should I do to skip all descriptions and just get Block4 in C++ regex? 我应该怎么做才能跳过所有描述,而只在C ++正则表达式中获取Block4?

This regex detects Block1 , Block2 , Block3 and Block4 but I want to skip Block1 , Block2 , Block3 and just get Block4 (skip descriptions). 此正则表达式检测到Block1Block2Block3Block4但是我想跳过Block1Block2Block3并仅获取Block4 (跳过描述)。 How do I have to edit my regex to get just Block4 (everything outside the descriptions)? 我该如何编辑我的正则表达式以获取Block4 (描述之外的所有内容)?

Tl;DR: Regular expressions cannot be used to parse full blown computer languages . Tl; DR: 正则表达式不能用于解析完整的计算机语言 What you want to do cannot be done with regular expressions. 使用正则表达式无法完成您想做的事情。 You need to develop a mini-C++ parser to filter out comments. 您需要开发一个mini-C ++解析器来过滤注释。 The answer to this related question might point you in the right direction . 有关此问题的答案可能会为您指明正确的方向

Regex can be used to process regular expressions , but computer languages such as C++, PHP, Java, C#, HTML, etc. have a more complex syntax that includes a property named "middle recursion". 正则表达式可用于处理正则表达式 ,但是计算机语言(例如C ++,PHP,Java,C#,HTML等)具有更复杂的语法,其中包括名为“中间递归”的属性。 Middle recursion includes complications such as an arbitrary number of matching parenthesis, begin / end quotes, and comments that can contain symbols 中间递归包括一些复杂性,例如任意数量的匹配括号,开始/结束引号以及可以包含符号的注释

If you want to understand this in more detail, read the answers to this question about the difference between regular expressions and context free grammars . 如果您想更详细地了解这一点,请阅读此问题的答案,以了解正则表达式和上下文无关文法之间的区别 If you are really curious, enroll in a Formal Language Theory class. 如果您真的很好奇,请参加形式语言理论课程。

Since you requested this long regex, here it is. 由于您请求了这个长的正则表达式,所以就在这里。

This will not handle nested Blocks like block{ block{ } } 这不会处理嵌套块,例如block{ block{ } }
it would match block{ block{ } } only. 它只会匹配block{ block{ } }。

Since you specified you are using C++11 as the engine, I didn't use 由于您指定使用C ++ 11作为引擎,所以我没有使用
recursion. 递归。 This is easily changed to use recursion say if you were to use 如果您要使用递归,则可以轻松更改为使用递归
PCRE or Perl, or even BOOST::Regex. PCRE或Perl,甚至BOOST :: Regex。 Let me know if you'd want to see that. 让我知道您是否想看看。

As it is it's flawed, but works for your sample. 因为它是有缺陷的,但适用于您的示例。
Another thing it won't do is parse Preprocessor Directives '#...' because 它不会做的另一件事是解析预处理程序指令“#...”,因为
I forgot the rules for that (thought I did it recently, but can't find a record). 我忘记了规则(本来是我最近做过,但是找不到记录)。

To use it, sit in a while ( regex_search() ) loop looking for a match on 要使用它,请坐在while ( regex_search() )循环中寻找匹配项
capture group 1, if (m[1].success) etc.. That will be your block. 捕获组1, if (m[1].success)等。那将是您的障碍。
The rest of the matches are for comments, quotes, or non-comments, unrelated 其余匹配项用于注释,引号或非注释,不相关
to the block. 到块。 These have to be matched to progress the match position. 必须对它们进行匹配以提高匹配位置。

The code is long and redundant because there is no function calls (recursion) in the C++11 EMCAscript. 该代码且冗长,因为C ++ 11 EMCAscript中没有函数调用(递归)。 Like I said, use boost::regex or something. 就像我说的,使用boost :: regex之类的东西。

Benchmark 基准测试

Sample: 样品:

/*
  Block1 {

    anythinghere
  }
*/

// Block2 { }

Block4 {

   // CommentedBlock{ asdfasdf }
    anyth"}"ing here
}

Block5 {

   /* CommentedBlock{ asdfasdf }
    anyth}"ing here
   */
}

Results: 结果:

Regex1:   (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   8
Elapsed Time:    1.95 s,   1947.26 ms,   1947261 µs

Regex Explained: 正则表达式说明:

    # Raw:        (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
    # Stringed:  "(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})|[\\S\\s](?:(?!\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})[^/\"'\\\\])*)"     


    (?:                              # Comments 
         /\*                              # Start /* .. */ comment
         [^*]* \*+
         (?: [^/*] [^*]* \*+ )*
         /                                # End /* .. */ comment
      |  
         //                               # Start // comment
         (?: [^\\] | \\ \n? )*?           # Possible line-continuation
         \n                               # End // comment
    )
 |                                 # OR,

    (?:                              # Non - comments 
         "
         [^"\\]*                          # Double quoted text
         (?: \\ [\S\s] [^"\\]* )*
         "
      |  '
         [^'\\]*                          # Single quoted text
         (?: \\ [\S\s] [^'\\]* )*
         ' 
      |  
         (                                # (1 start), BLOCK
              \w+ \s* \{               
              ####################
              (?:                              # ------------------------
                   (?:                              # Comments  inside a block
                        /\*                             
                        [^*]* \*+
                        (?: [^/*] [^*]* \*+ )*
                        /                                
                     |  
                        //                               
                        (?: [^\\] | \\ \n? )*?
                        \n                               
                   )
                |  
                   (?:                              # Non - comments inside a block
                        "
                        [^"\\]*                          
                        (?: \\ [\S\s] [^"\\]* )*
                        "
                     |  '
                        [^'\\]*                          
                        (?: \\ [\S\s] [^'\\]* )*
                        ' 
                     |  
                        (?! \} )
                        [\S\s]                          
                        [^}/"'\\]*                      
                   )
              )*                               # ------------------------
              #####################          
              \}                               
         )                                # (1 end), BLOCK

      |                                 # OR,

         [\S\s]                           # Any other char
         (?:                              # -------------------------
              (?!                              # ASSERT: Here, cannot be a BLOCK{ }
                   \w+ \s* \{                      
                   (?:                              # ==============================
                        (?:                              # Comments inside a block
                             /\*                              
                             [^*]* \*+
                             (?: [^/*] [^*]* \*+ )*
                             /                                
                          |  
                             //                               
                             (?: [^\\] | \\ \n? )*?
                             \n                               
                        )
                     |  
                        (?:                              # Non - comments inside a block
                             "
                             [^"\\]*                          
                             (?: \\ [\S\s] [^"\\]* )*
                             "
                          |  
                             '
                             [^'\\]*                          
                             (?: \\ [\S\s] [^'\\]* )*
                             ' 
                          |  
                             (?! \} )
                             [\S\s]                          
                             [^}/"'\\]*                       
                        )
                   )*                               # ==============================
                   \}                               
              )                                # ASSERT End

              [^/"'\\]                         # Char which doesn't start a comment, string, escape,
                                               # or line continuation (escape + newline)
         )*                               # -------------------------
    )                                # Done Non - comments 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM