如何在正則表達式中指定匹配模式的優先級？

Question

我正在編寫一個函數解析引擎，它使用正則表達式來分隔各個術語（定義為常量或變量，后跟（可選）由運算符）。 它工作得很好，除非我在其他分組術語中對術語進行分組。 這是我正在使用的代碼：

//This matches an opening delimiter
Regex openers = new Regex("[\\[\\{\\(]");

//This matches a closing delimiter
Regex closers = new Regex("[\\]\\}\\)]");

//This matches the name of a variable (\w+) or a constant numeric value (\d+(\.\d+)?)
Regex VariableOrConstant = new Regex("((\\d+(\\.\\d+)?)|\\w+)" + FunctionTerm.opRegex + "?");

//This matches the binary operators +, *, -, or /
Regex ops = new Regex("[\\*\\+\\-/]");

//This compound Regex finds a single variable or constant term (including a proceeding operator,
//if any) OR a group containing multiple terms (and their proceeding operators, if any)
//and a proceeding operator, if any.
//Matches that match this second pattern need to be added to the function as sub-functions,
//not as individual terms, to ensure the correct evalutation order with parentheses.
Regex splitter = new Regex(
openers + 
"(" + VariableOrConstant + ")+" + closers + ops + "?" +
"|" +
"(" + VariableOrConstant + ")" + ops + "?");

當“分離器”與字符串“4 /（2 * X * [2 + 1]）”匹配時，匹配的值為“4 /”，“2 *”，“X *”，“2 +”，和“1”，完全忽略所有分隔括號和括號。 我相信這是因為“分離器”正則表達式的后半部分（“|”之后的部分）正在匹配並覆蓋模式的其他部分。 這很糟糕 - 我希望分組表達式優先於單個術語。 有誰知道我怎么做到這一點？ 我研究過使用正面/負面的前瞻和外觀，但我真的不確定如何使用它們，或者它們甚至是什么，就此而言，我找不到任何相關的例子......提前謝謝。

Answer 1

你沒有告訴我們你是如何應用正則表達式的，所以這是一個我掀起的演示：

private static void ParseIt(string subject)
{
  Console.WriteLine("subject : {0}\n", subject);

  Regex openers = new Regex(@"[[{(]");
  Regex closers = new Regex(@"[]})]");
  Regex ops = new Regex(@"[*+/-]");
  Regex VariableOrConstant = new Regex(@"((\d+(\.\d+)?)|\w+)" + ops + "?");

  Regex splitter = new Regex(
    openers + @"(?<FIRST>" + VariableOrConstant + @")+" + closers + ops + @"?" +
    @"|" +
    @"(?<SECOND>" + VariableOrConstant + @")" + ops + @"?",
    RegexOptions.ExplicitCapture
  );

  foreach (Match m in splitter.Matches(subject))
  {
    foreach (string s in splitter.GetGroupNames())
    {
      Console.WriteLine("group {0,-8}: {1}", s, m.Groups[s]);
    }
    Console.WriteLine();
  }
}

輸出：

subject : 4/(2*X*[2+1])

group 0       : 4/
group FIRST   :
group SECOND  : 4/

group 0       : 2*
group FIRST   :
group SECOND  : 2*

group 0       : X*
group FIRST   :
group SECOND  : X*

group 0       : [2+1]
group FIRST   : 1
group SECOND  :

正如可以看到的，術語[2+1] 由正則表達式的第一部分相匹配，因為你意。 它無法做任何事情(但是，因為之后的下一個包圍角色是另一個“開啟者”（ [ ），並且它正在尋找“更接近”。

您可以使用.NET的“平衡匹配”功能來允許其他組中包含的分組術語，但這不值得。 正則表達式不是為解析而設計的 - 實際上，解析和正則表達式匹配是根本上不同的操作類型。 這是差異的一個很好的例子：正則表達式主動尋找匹配，跳過它不能使用的任何東西（比如你的例子中的開括號），但解析器必須檢查每個字符（即使它只是為了決定忽略它）。

關於演示：我嘗試進行必要的最小功能更改以使代碼工作（這就是我沒有糾正將+ 放在捕獲組之外的錯誤的原因），但我也進行了幾次表面更改，以及那些代表積極的建議。 以機智：

在C＃中創建正則表達式時，始終使用逐字字符串文字（ @"..." ）（我認為原因很明顯）。
如果您正在使用捕獲組，請盡可能使用命名組，但不要在同一個正則表達式中使用命名組和編號組。 命名組可以省去跟蹤捕獲的位置的麻煩，並且ExplicitCapture選項可以保存您在需要非捕獲組的地方使用(?:...)來混淆正則表達式。

最后，從一堆較小的正則表達式構建大型正則表達式的整個方案對IMO的用處非常有限。 跟蹤部件之間的相互作用非常困難，例如哪個部件在哪個部件內。 C＃的逐字字符串的另一個優點是它們是多行的，因此您可以利用自由間隔模式（aka /x或COMMENTS模式）：

  Regex r = new Regex(@"
    (?<GROUPED>
      [[{(]                  # opening bracket
      (                      # group containing:
        ((\d+(\.\d+)?)|\w+)     # number or variable
        [*+/-]?                 # and proceeding operator
      )+                     # ...one or more times
      []})]                  # closing bracket
      [*+/-]?                # and proceeding operator
    )
    |
    (?<UNGROUPED>
      ((\d+(\.\d+)?)|\w+)    # number or variable
      [*+/-]?                # and proceeding operator
    )
    ",
    RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace
  );

這不是解決您的問題的方法; 正如我所說，這不是正則表達式的工作。 這只是一些有用的正則表達式技術的演示。

Answer 2

嘗試使用不同的量詞

貪婪：

*  +  ?

所有格：

*+ ++ ?+

懶：

*? +? ??

試試看這個和這個

也許是非捕獲組：

(?:your expr here)

試一試！ 實踐使完美！ :)

如何在正則表達式中指定匹配模式的優先級？

問題描述

2 個解決方案

解決方案1
5 已采納 2010-12-14 03:47:34

解決方案2
3 2010-12-13 15:19:28

如何在正則表達式中指定匹配模式的優先級？

問題描述

2 個解決方案

解決方案1 5 已采納 2010-12-14 03:47:34

解決方案2 3 2010-12-13 15:19:28

解決方案1
5 已采納 2010-12-14 03:47:34

解決方案2
3 2010-12-13 15:19:28