PHP正則表達式崩潰apache

Question

我有一個匹配模板系統的正則表達式，不幸的是，它似乎會在一些簡單易懂的查找中崩潰apache（它在Windows上運行）。 我已經研究了這個問題，並且有一些建議可以提高堆棧大小等，但這些建議似乎都沒有用，而且我真的不喜歡通過增加限制來處理這些問題，因為它通常只會將bug推向未來。

無論如何，任何關於如何改變正則表達式以使其不太可能被污染的想法？

我的想法是捕獲最里面的塊（在這種情況下{block:test}This should be caught first!{/block:test} ）然后我將str_replace out起始/結束標記並重新運行整個過程正則表達式，直到沒有塊。

正則表達式：

~(?P<opening>{(?P<inverse>[!])?block:(?P<name>[a-z0-9\s_-]+)})(?P<contents>(?:(?!{/?block:[0-9a-z-_]+}).)*)(?P<closing>{/block:\3})~ism

示例模板：

<div class="f_sponsors s_banners">
    <div class="s_previous">&laquo;</div>
    <div class="s_sponsors">
        <ul>
            {block:sponsors}
            <li>
                <a href="{var:url}" target="_blank">
                    <img src="image/160x126/{var:image}" alt="{var:name}" title="{var:name}" />
                </a>
            {block:test}This should be caught first!{/block:test}
            </li>
            {/block:sponsors}
        </ul>
    </div>
    <div class="s_next">&raquo;</div>
</div>

我想這是一個很長的鏡頭。 :(

Answer 1

您可以使用atomic group: (?>...)或possessive quantifiers: ?+ *+ ++..來抑制/限制回溯並通過unrolling loop技術加速匹配。 我的解決方案

\\{block:(\\w++)\\}([^<{]++(?:(?!\\{\\/?block:\\1\\b)[<{][^<{]*+)*+)\\{/block:\\1\\}

我從http://regexr.com?31p03測試過。

匹配{block:sponsors}...{/block:sponsors} ：
\\{block:(sponsors)\\}([^<{]++(?:(?!\\{\\/?block:\\1\\b)[<{][^<{]*+)*+)\\{/block:\\1\\}
http://regexr.com?31rb3

匹配{block:test}...{/block:test} ：
\\{block:(test)\\}([^<{]++(?:(?!\\{\\/?block:\\1\\b)[<{][^<{]*+)*+)\\{/block:\\1\\}
http://regexr.com?31rb6

另一個解決方案：
在PCRE源代碼中，您可以從config.h刪除注釋：
/* #undef NO_RECURSE */

從config.h下面的文本副本：
PCRE使用遞歸函數調用來處理匹配時的回溯。 在有限大小的堆棧的系統上，這有時可能是一個問題。 定義NO_RECURSE以獲取在match（）函數中不使用遞歸的版本; 相反，它使用pcre_recurse_malloc（）通過steam創建自己的堆棧，以從堆中獲取內存。

或者您可以從php.ini更改pcre.backtrack_limit和pcre.recursion_limit （http://www.php.net/manual/en/pcre.configuration.php）

Answer 2

試試這個：

'~(?P<opening>\{(?P<inverse>[!])?block:(?P<name>[a-z0-9\s_-]+)\})(?P<contents>[^{]*(?:\{(?!/block:(?P=name)\})[^{]*)*)(?P<closing>\{/block:(?P=name)\})~i'

或者，以可讀的形式：

'~(?P<opening>
  \{
  (?P<inverse>[!])?
  block:
  (?P<name>[a-z0-9\s_-]+)
  \}
)
(?P<contents>
  [^{]*(?:\{(?!/block:(?P=name)\})[^{]*)*
)
(?P<closing>
  \{
  /block:(?P=name)
  \}
)~ix'

最重要的部分是(?P<contents>..)組：

[^{]*(?:\{(?!/block:(?P=name)\})[^{]*)*

首先，我們唯一感興趣的角色是開口大括號，所以我們可以用[^{]*來填充任何其他角色。 只有在我們看到{我們檢查它是否是{/block}標簽的開頭之后。 如果不是，我們繼續使用它並開始掃描下一個，並根據需要重復。

使用RegexBuddy，我通過將光標放在{block:sponsors}標簽的開頭並進行調試來測試每個正則表達式。 然后我從結束{/block:sponsors}標簽中刪除了結束括號以強制失敗匹配並再次調試它。 你的正則表達式花了940步才成功，2265步失敗了。 我采取了57步取得成功，83步失敗。

在旁注中，我刪除了s修飾符因為因為我沒有使用點（ . ）和m修飾符，因為它從來不需要。 根據@ DaveRandom的優秀建議，我還使用了命名的反向引用(?P=name)而不是\\3 。 我逃脫了所有大括號（ {和} ），因為我發現這種方式更容易閱讀。

編輯：如果你想匹配最里面的命名塊，改變正則表達式的中間部分：

(?P<contents>
  [^{]*(?:\{(?!/block:(?P=name)\})[^{]*)*
)

......對此（正如@Kobi在評論中所建議的那樣）：

(?P<contents>
  [^{]*(?:\{(?!/?block:[a-z0-9\s_-]+\})[^{]*)*
)

最初， (?P<opening>...)組會抓住它看到的第一個開始標記，然后(?P<contents>..)組會消耗任何東西 - 包括其他標記 - 只要它們不是結束標記與(?P<opening>...)組找到的標記相匹配。 （然后(?P<closing>...)組會繼續使用它。）

現在， (?P<contents>...)組拒絕匹配任何標簽，打開或關閉（注意開頭的/? ），無論名稱是什么。 因此，正則表達式最初開始匹配{block:sponsors}標記，但是當它遇到{block:test}標記時，它會放棄該匹配並返回搜索開始標記。 它再次從{block:test}標記開始，這次在找到{/block:test}結束標記時成功完成匹配。

這聽起來效率低，但實際上並非如此。 我之前描述的技巧，摒棄了非支撐，淹沒了這些錯誤開始的效果。 你幾乎在每個位置做了一個負向前瞻，現在你只在遇到{時才做一個。 您甚至可以使用占有量詞，就像@godspeedlee建議的那樣：

(?P<contents>
  [^{]*+(?:\{(?!/?block:[a-z0-9\s_-]+\})[^{]*+)*+
)

...因為你知道它永遠不會消耗它后來必須回饋的東西。 這樣可以加快速度，但實際上並不是必需的。

Answer 3

解決方案是否必須是單個正則表達式？ 更有效的方法可能只是尋找第一次出現{/block:可能是一個簡單的字符串搜索或正則表達式），然后從該點向后搜索以找到其匹配的開始標記，適當地替換范圍並重復直到沒有更多的塊。 如果每次都從模板頂部開始尋找第一個結束標記，那么這將為您提供最深層嵌套的塊。

鏡像算法也可以正常工作 - 查找最后一個開始標記，然后從那里搜索相應的結束標記：

<?php

$template = //...

while(true) {
  $last_open_tag = strrpos($template, '{block:');
  $last_inverted_tag = strrpos($template, '{!block:');
  // $block_start is the index of the '{' of the last opening block tag in the
  // template, or false if there are no more block tags left
  $block_start = max($last_open_tag, $last_inverted_tag);
  if($block_start === false) {
    // all done
    break;
  } else {
    // extract the block name (the foo in {block:foo}) - from the character
    // after the next : to the character before the next }, inclusive
    $block_name_start = strpos($template, ':', $block_start) + 1;
    $block_name = substr($template, $block_name_start,
        strcspn($template, '}', $block_name_start));

    // we now have the start tag and the block name, next find the end tag.
    // $block_end is the index of the '{' of the next closing block tag after
    // $block_start.  If this doesn't match the opening tag something is wrong.
    $block_end = strpos($template, '{/block:', $block_start);
    if(strpos($template, $block_name.'}', $block_end + 8) !== $block_end + 8) {
      // non-matching tag
      print("Non-matching tag found\n");
      break;
    } else {
      // now we have found the innermost block
      // - its start tag begins at $block_start
      // - its content begins at
      //   (strpos($template, '}', $block_start) + 1)
      // - its content ends at $block_end
      // - its end tag ends at ($block_end + strlen($block_name) + 9)
      //   [9 being the length of '{/block:' plus '}']
      // - the start tag was inverted iff $block_start === $last_inverted_tag
      $template = // do whatever you need to do to replace the template
    }
  }
}

echo $template;

PHP正則表達式崩潰apache

問題描述

3 個解決方案

解決方案1
4 2012-08-07 18:09:51

解決方案2
4 已采納 2012-08-08 02:52:55

解決方案3
4 2012-08-15 08:43:43

PHP正則表達式崩潰apache

問題描述

3 個解決方案

解決方案1 4 2012-08-07 18:09:51

解決方案2 4 已采納 2012-08-08 02:52:55

解決方案3 4 2012-08-15 08:43:43

解決方案1
4 2012-08-07 18:09:51

解決方案2
4 已采納 2012-08-08 02:52:55

解決方案3
4 2012-08-15 08:43:43