[英]Regex.Split() using ( and ) as delimiter except when surrounded by single quotes
I have an input string like: 我有一个输入字符串,如:
'lambda' '(' VARIABLE (',' VARIABLE)* ')' EXPRESSION (EXPRESSION)+ 'lambda''('VARIABLE(','VARIABLE)*')'表达式(表达式)+
and need to split it into tokens separated by spaces, ( and ) and [ and ], except when a ( or ) is immediately surrounded by single quotes. 并且需要将它分成由空格,(和)和[和]分隔的标记,除非(或)被单引号括起来。
I would like to create a regex expression to use with C#'s Regex.Split() method that will split the string into the following tokens: 我想创建一个正则表达式来与C#的Regex.Split()方法一起使用,该方法将字符串拆分为以下标记:
['lambda', '(', VARIABLE, (, ',' VARIABLE, ), *, ')', EXPRESSION, (, EXPRESSION, ), +] ['lambda','(',VARIABLE,(,','VARIABLE,),*,')',EXPRESSION,(,EXPRESSION,),+]
I was previously using the following regex: 我以前使用以下正则表达式:
(?=[ \\(\\)\\|\\[\\]])|(?<=[ \\(\\)\\|\\[\\]]) (?= [\\(\\)\\ | \\ [\\]])|(?<= [\\(\\)\\ | \\ [\\]])
which worked great except for when ( or ) is surrounded by single quotes, in which case 除了在(或)被单引号包围的情况下,其效果很好,在这种情况下
'(' '('
gets separated into 被分开了
[', (, '] [',(,']
Help is greatly appreciated. 非常感谢帮助。
EDIT 编辑
Well, I now have one less problem. 好吧,我现在少了一个问题。 Here was my eventual solution without using regex at all: 这是我最终的解决方案,根本不使用正则表达式:
private void Scan()
{
List<char> accum = new List<char>();
int index = 0;
List<string> tokens = new List<string>();
if (INPUT.Length == 0)
return;
while (true)
{
if ((index == INPUT.Length) ||
(
(
(index == 0 || INPUT[index - 1].ToString() != "'") ||
(index == INPUT.Length - 1 || INPUT[index + 1].ToString() != "'") ||
(INPUT[index] == ' ')
)
&&
(
INPUT[index] == ' ' ||
INPUT[index] == '(' ||
INPUT[index] == ')' ||
INPUT[index] == '[' ||
INPUT[index] == ']' ||
INPUT[index] == '|'
)
)
)
{
string accumulatedToken = string.Join("", accum);
string currentToken = index < INPUT.Length ? INPUT[index].ToString() : "";
tokens.Add(accumulatedToken);
tokens.Add(currentToken);
CURRENT_TOKEN = tokens.FirstOrDefault(t => !string.IsNullOrWhiteSpace(t));
INPUT = INPUT.Substring(CURRENT_TOKEN.Length).TrimStart();
if (CURRENT_TOKEN != null)
{
break;
}
index = 0;
}
else
{
accum.Add(INPUT[index]);
index++;
}
}
}
The regex to pull this off becomes simpler once you know that it's possible to split and retain a delimiter by placing the delimiter(s) within a group. 一旦您知道可以通过将分隔符放在一个组中来分割和保留分隔符 ,那么将其解除的正则规则变得更加简单。
The following pattern yields the output you mentioned: 以下模式产生您提到的输出:
var input = "'lambda' '(' VARIABLE (',' VARIABLE)* ')' EXPRESSION (EXPRESSION)+";
var pattern = @"\s*('[()]'|[()])\s*|[\s[\]]";
var result = Regex.Split(input, pattern);
Console.WriteLine(result);
Pattern explanation: \\s*('[()]'|[()])\\s*|[\\s[\\]]
模式说明: \\s*('[()]'|[()])\\s*|[\\s[\\]]
\\s*('[()]'|[()])\\s*
: \\s*('[()]'|[()])\\s*
:
\\s*
: trim leading/trailing whitespace (placed at both ends) \\s*
:修剪前导/尾随空格(放在两端) ('[()]'|[()])
: this entire portion is placed within a group (...)
since we want to split on the delimiters within and include them in the result. ('[()]'|[()])
:这整个部分放在一个组(...)
因为我们想要在分隔符内拆分并将它们包含在结果中。 We want to match parentheses within single quotes, '[()]'
, and parentheses that aren't enclosed within single quotes [()]
. 我们希望在单引号, '[()]'
和括号中匹配括号,这些括号不包含在单引号[()]
。 |
: alternation to match the first group or the next portion :交替匹配第一组或下一部分 [\\s[\\]]
: split on whitespace, [
or ]
[\\s[\\]]
:拆分空格, [
或]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.