简体   繁体   English

Regex.Split()使用(和)作为分隔符,除非用单引号括起来

[英]Regex.Split() using ( and ) as delimiter except when surrounded by single quotes

I have an input string like: 我有一个输入字符串,如:

'lambda' '(' VARIABLE (',' VARIABLE)* ')' EXPRESSION (EXPRESSION)+ 'lambda''('VARIABLE(','VARIABLE)*')'表达式(表达式)+

and need to split it into tokens separated by spaces, ( and ) and [ and ], except when a ( or ) is immediately surrounded by single quotes. 并且需要将它分成由空格,(和)和[和]分隔的标记,除非(或)被单引号括起来。

I would like to create a regex expression to use with C#'s Regex.Split() method that will split the string into the following tokens: 我想创建一个正则表达式来与C#的Regex.Split()方法一起使用,该方法将字符串拆分为以下标记:

['lambda', '(', VARIABLE, (, ',' VARIABLE, ), *, ')', EXPRESSION, (, EXPRESSION, ), +] ['lambda','(',VARIABLE,(,','VARIABLE,),*,')',EXPRESSION,(,EXPRESSION,),+]

I was previously using the following regex: 我以前使用以下正则表达式:

(?=[ \\(\\)\\|\\[\\]])|(?<=[ \\(\\)\\|\\[\\]]) (?= [\\(\\)\\ | \\ [\\]])|(?<= [\\(\\)\\ | \\ [\\]])

which worked great except for when ( or ) is surrounded by single quotes, in which case 除了在(或)被单引号包围的情况下,其效果很好,在这种情况下

'(' '('

gets separated into 被分开了

[', (, '] [',(,']

Help is greatly appreciated. 非常感谢帮助。

EDIT 编辑

Well, I now have one less problem. 好吧,我现在少了一个问题。 Here was my eventual solution without using regex at all: 这是我最终的解决方案,根本不使用正则表达式:

    private void Scan()
    {
        List<char> accum = new List<char>();

        int index = 0;

        List<string> tokens = new List<string>();

        if (INPUT.Length == 0)
            return;

        while (true)
        {
            if ((index == INPUT.Length) || 
                (
                    (
                     (index == 0 || INPUT[index - 1].ToString() != "'") || 
                     (index == INPUT.Length - 1 || INPUT[index + 1].ToString() != "'") || 
                     (INPUT[index] == ' ')
                    ) 
                    &&
                    (
                     INPUT[index] == ' ' || 
                     INPUT[index] == '(' || 
                     INPUT[index] == ')' || 
                     INPUT[index] == '[' || 
                     INPUT[index] == ']' || 
                     INPUT[index] == '|'
                    )
                )
            )
            {
                string accumulatedToken = string.Join("", accum);
                string currentToken = index < INPUT.Length ? INPUT[index].ToString() : "";
                tokens.Add(accumulatedToken);
                tokens.Add(currentToken);

                CURRENT_TOKEN = tokens.FirstOrDefault(t => !string.IsNullOrWhiteSpace(t));

                INPUT = INPUT.Substring(CURRENT_TOKEN.Length).TrimStart();

                if (CURRENT_TOKEN != null)
                {
                    break;
                }

                index = 0;
            }
            else
            {
                accum.Add(INPUT[index]);
                index++;
            }
        }
    }

The regex to pull this off becomes simpler once you know that it's possible to split and retain a delimiter by placing the delimiter(s) within a group. 一旦您知道可以通过将分隔符放在一个组中来分割保留分隔符 ,那么将其解除的正则规则变得更加简单。

The following pattern yields the output you mentioned: 以下模式产生您提到的输出:

var input = "'lambda' '(' VARIABLE (',' VARIABLE)* ')' EXPRESSION (EXPRESSION)+";
var pattern = @"\s*('[()]'|[()])\s*|[\s[\]]";
var result = Regex.Split(input, pattern);
Console.WriteLine(result);

Pattern explanation: \\s*('[()]'|[()])\\s*|[\\s[\\]] 模式说明: \\s*('[()]'|[()])\\s*|[\\s[\\]]

  • \\s*('[()]'|[()])\\s* : \\s*('[()]'|[()])\\s*
    • \\s* : trim leading/trailing whitespace (placed at both ends) \\s* :修剪前导/尾随空格(放在两端)
    • ('[()]'|[()]) : this entire portion is placed within a group (...) since we want to split on the delimiters within and include them in the result. ('[()]'|[()]) :这整个部分放在一个组(...)因为我们想要在分隔符内拆分并将它们包含在结果中。 We want to match parentheses within single quotes, '[()]' , and parentheses that aren't enclosed within single quotes [()] . 我们希望在单引号, '[()]'和括号中匹配括号,这些括号不包含在单引号[()]
  • | : alternation to match the first group or the next portion :交替匹配第一组或下一部分
  • [\\s[\\]] : split on whitespace, [ or ] [\\s[\\]] :拆分空格, []

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM