简体   繁体   English

解析类似正则表达式的字符串

[英]Parsing a regex-like string

I'm trying to parse a Regex-like string, with format like the following:我正在尝试解析类似 Regex 的字符串,格式如下:

  • The only characters that need to be escaped are: [ , \\ , and - ;唯一需要转义的字符是: [\\-
  • A valid string can be a sequence of:一个有效的字符串可以是一个序列:
    • "regular characters", eg a , b “常规字符”,例如a , b
    • "escaped special characters", eg \\\\ , \\[ “转义特殊字符”,例如\\\\\\[
    • Sequences containing above two, wrapped by a pair of brackets, eg [abc] , [a\\]]包含以上两个的序列,用一对括号括起来,例如[abc] , [a\\]]

For example, abc[def]g , abc\\-\\[[def\\]]gh\\\\ are both valid strings.例如, abc[def]gabc\\-\\[[def\\]]gh\\\\都是有效的字符串。

Is there some way that I can get the character / character class (in the third case above) at each index?有什么方法可以在每个索引处获取字符/字符类(在上面的第三种情况下)? Using pure regex / sed or some Python library works for me.使用纯正regex / sed或一些 Python 库对我有用。

Usually, you can't parse it character by character, but have to parse it通常,您无法逐个解析它,但必须解析它
construct by construct.逐个构建。

Knowing the group that matched tells you what the construct is.了解匹配的组会告诉您构造是什么。
When the class construct matches, you have to parse it's contents构造匹配时,您必须解析它的内容
separate from the main regex.与主正则表达式分开。

You'd check this all in (pseudo-code)你会检查这一切(伪代码)

while( regex find )而(正则表达式查找)
{ {
if group 1 matched // the character escaped如果第 1 组匹配 // 字符被转义
else别的
if group 2 matched // a non-class start or non-escaped char如果第 2 组匹配 // 非类开始或非转义字符
// Check if it should be escaped, or is a metachar // 检查它是否应该被转义,或者是一个元字符
else别的
if group 3 matched // class contents如果第 3 组匹配 // 类内容
// parse the class contents here // 在这里解析类内容
else别的
if group 4 matched // error如果第 4 组匹配 // 错误
} }

For example purposes ..例如目的..

(?s)(?:\\\\(.)|([^\\[])|\\[((?:\\\\.|[^\\]])*)\\]|(.))

Expanded展开

 (?s)                # Dot all modifier
 (?:
      \\                  # Escape anything
      ( . )               # (1)
   |                    # or,
      ( [^\[] )           # (2), Anything that does not start a char class
   |                    # or,
      \[                  # Start of char class
      (                   # (3 start)
           (?:                 # ----------
                \\ .                # Escape anything
             |                    # or,
                [^\]]               # Anthing that does not end a char class
           )*                  # ----------
      )                   # (3 end)
      \]                  # End of char class
   |                    # or,
      ( . )               # (4), Error, probably an unbalanced '['     
 )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM