Java正則表達式中帶有負前瞻斷言的奇怪性

Question

我正在努力理解Java中正則表達式的行為，並遇到了一些看起來很奇怪的東西。 在下面的代碼中，測試突然失敗，原因是我在測試時不理解消息標簽“6個字母匹配，負面”（后續的兩個測試也失敗了）。 我一直盯着這個太長時間，還是確實發生了一些奇怪的事情？ 我不是這與可變長度負向前瞻斷言（？！X）有關，但我很樂意聽到任何理論，甚至確認其他人遇到同樣的問題，並且它不是特定於我的JVM。 對不起，正則表達式是如此做作，但你不想看到真實的東西:)

// $ java -version
// java version "1.7.0_10"
// Java(TM) SE Runtime Environment (build 1.7.0_10-b18)
// Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)

// test of word without agreement
String test = "plusieurs personne sont";

// match the pattern with curly braces
assertTrue("no letters matched", Pattern.compile("plusieurs personne\\b").matcher(test).find());
assertTrue("1 letters matched", Pattern.compile("plusieurs personn\\p{Alpha}{1,100}\\b").matcher(test).find());
assertTrue("2 letters matched", Pattern.compile("plusieurs person\\p{Alpha}{1,100}\\b").matcher(test).find());
assertTrue("3 letters matched", Pattern.compile("plusieurs perso\\p{Alpha}{1,100}\\b").matcher(test).find());
assertTrue("4 letters matched", Pattern.compile("plusieurs pers\\p{Alpha}{1,100}\\b").matcher(test).find());
assertTrue("5 letters matched", Pattern.compile("plusieurs per\\p{Alpha}{1,100}\\b").matcher(test).find());
assertTrue("6 letters matched", Pattern.compile("plusieurs pe\\p{Alpha}{1,100}\\b").matcher(test).find());
assertTrue("7 letters matched", Pattern.compile("plusieurs p\\p{Alpha}{1,100}\\b").matcher(test).find());
assertTrue("8 letters matched", Pattern.compile("plusieurs \\p{Alpha}{1,100}\\b").matcher(test).find());

// match the negative pattern (without s or x) with curly braces
assertTrue("no letters matched, negative", Pattern.compile("plusieurs (?!personne[sx])\\w+").matcher(test).find());
assertTrue("1 letters matched, negative", Pattern.compile("plusieurs (?!personn\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find());
assertTrue("2 letters matched, negative", Pattern.compile("plusieurs (?!person\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find());
assertTrue("3 letters matched, negative", Pattern.compile("plusieurs (?!perso\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find());
assertTrue("4 letters matched, negative", Pattern.compile("plusieurs (?!pers\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find());
assertTrue("5 letters matched, negative", Pattern.compile("plusieurs (?!per\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find());
// the assertion below fails (is false) for reasons unknown
assertTrue("6 letters matched, negative", Pattern.compile("plusieurs (?!pe\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find());
assertTrue("7 letters matched, negative", Pattern.compile("plusieurs (?!p\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find());
assertTrue("8 letters matched, negative", Pattern.compile("plusieurs (?!\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find());

Answer 1

讓我們看看前瞻是如何匹配的：

pe           literal, matches "pe"
r            matches \p{Alpha}{1,100}
s            matches [sx]

因此負向前瞻不匹配（你的字符串的尾部， "onne sont" ，這里無關緊要）。

如果您的想法是下一個單詞不應以s或x結尾，則在[sx]之后放置\\\\b可能會有所幫助。 始終要記住，負面前瞻不會對失敗感到遺憾 ，並且它不會回溯以便找到如何使你的正則表達式不匹配 。

UPD：讓我們仔細觀察案例5，將其與案例6進行比較。這里我們使用假設匹配（對於前瞻內部的表達式），因此我們必須考慮幾種可能（幾乎）發生的變體。

per          literal, would match "per" -- it's always so
             -- let's try to imagine how the rest could match:
sonn         would match \p{Alpha}{1,100}
e            wouldn't match [sx], FAIL
             -- or maybe
s            would match \p{Alpha}{1,100}
o            wouldn't match [sx], FAIL
             -- or maybe yet
so           would match \p{Alpha}{1,100}
n            wouldn't match [sx], FAIL.

我們還會有一個有趣的冒險，如果第二個字是“personali 小號通貨膨脹”。

UPD2：評論中的討論促使我在這里添加一個概括：

正則表達式很有吸引力，因為它們具有人類思維的重要特征：確認偏差。 當我們編寫正則表達式時，我們希望它們匹配; 即使我們的工作是防止無效輸入，我們也會在大多數時候考慮有效輸入。 正則表達式匹配器通常共享此屬性：它想匹配並且討厭失敗。 這就是為什么像\\p{Alpha}{1,100}這樣的子表達式並不意味着“在嘗試匹配其余輸入之前，吃掉最長的Alpha可用塊”。 它粗略地意味着“考慮長度在[1,100]內的每一個可能的Alpha塊，找到一種方法讓整個表達式匹配”。

這就是為什么使用正則表達式時，很容易忽略復雜表達式的誤報：錯誤接受的無效輸入。 這個問題沒有出現，當我們使用負向前瞻時它變得更加明顯 ：

在負面預測中，regexp matcher 想要匹配內部表達式（使外部表達式失敗）。 人類程序員仍然希望匹配外部表達; 正如我們在我們的例子中看到的，這個因素確實會影響我們對內在表達的推理。 我們認為它不應該如此難以匹配 （例如，它應該以愚蠢的方式處理子表達式，立即吃掉最長的輸入）。 匹配器通常起作用，但我們對理想行為的想法現在與其算法不同步。 內在表達的誤報（很難注意到）會成為外部表達的錯誤否定（我們注意到並且討厭）。

Java正則表達式中帶有負前瞻斷言的奇怪性

問題描述

1 個解決方案

解決方案1
3 已采納 2013-01-16 21:26:22

Java正則表達式中帶有負前瞻斷言的奇怪性

問題描述

1 個解決方案

解決方案1 3 已采納 2013-01-16 21:26:22

解決方案1
3 已采納 2013-01-16 21:26:22