使用正則表達式刪除Flex / AS3中的HTML標簽

Question

我正在用Flex（AS3）編寫HTML解析器，並且需要刪除一些不需要的HTML標記。

例如，我要從以下代碼中刪除div：

           <div>
              <div>
                <div>
                  <div>
                    <div>
                      <div>
                        <div>
                          <p style="padding-left: 18px; padding-right: 20px; text-align: center;">
                            <span></span>
                            <span style=" font-size: 48px; color: #666666; font-style: normal; font-weight: bold; text-decoration: none; font-family: Arial;">20% OFF.</span>
                            <span> </span>
                            <span style=" font-size: 48px; color: #666666; font-style: normal; font-weight: normal; text-decoration: none; font-family: Arial;">Do it NOW!</span>
                            <span> </span>
                          </p>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </div>

最后是這樣的：

                      <div>
                          <p style="padding-left: 18px; padding-right: 20px; text-align: center;">
                            <span></span>
                            <span style=" font-size: 48px; color: #666666; font-style: normal; font-weight: bold; text-decoration: none; font-family: Arial;">20% OFF.</span>
                            <span> </span>
                            <span style=" font-size: 48px; color: #666666; font-style: normal; font-weight: normal; text-decoration: none; font-family: Arial;">Do it NOW!</span>
                            <span> </span>
                          </p>
                        </div>

我的問題是，如何編寫正則表達式來刪除這些不需要的DIV？ 有更好的方法嗎？

提前致謝。

Answer 1

您不能將任意嵌套的結構與正則表達式匹配，因為嵌套意味着不規則。 解析器（您正在編寫）是正確的工具。

現在，在這種非常特殊的情況下，您可以

result = subject.replace(/^\s*(<\/?div>)(?:\s*\1)*(?=\s*\1)/mg, "");

（這將刪除最后一個除<div>或</div>之后的所有直接后續出現的東西），但這在很多方面都是不好的，以至於我擔心它會使我被淘汰。

解釋：

^           # match start of line
\s*         # match leading whitespace
(</?div>)   # match a <div> or </div>, remember which
(?:\s*\1)*  # match any further <div> or </div>, same one as before
(?=\s*\1)   # as long as there is another one right ahead

您能數出這些失敗的方式嗎？ （請考慮注釋，不匹配的<div>等）。

Answer 2

假設您的目標HTML實際上是有效的XML，則可以使用遞歸函數將非div位拖出。

static function grabNonDivContents(xml:XML):XMLList {
    var out:XMLList = new XMLList();
    var kids:XMLList = xml.children();
    for each (var kid:XML in kids) {
        if (kid.name() && kid.name() == "div") {
            var grandkids:XMLList = grabNonDivContents(kid);
            for each (var grandkid:XML in grandkids) {
                out += grandKid;
            }
        } else {
            out += kid;
        }
    }
    return out;
}

Answer 3

以我的經驗，僅使用regex解析復雜的html就是地獄。 正則表達式很快就失控了。 提取所需信息（也許使用簡單的正則表達式）並將它們組合回更簡單的文檔中，它的功能要強大得多。

使用正則表達式刪除Flex / AS3中的HTML標簽

問題描述

3 個解決方案

解決方案1
2 2010-09-26 09:19:59

解決方案2
1 已采納 2010-09-27 06:06:35

解決方案3
0 2010-09-26 09:54:10

使用正則表達式刪除Flex / AS3中的HTML標簽

問題描述

3 個解決方案

解決方案1 2 2010-09-26 09:19:59

解決方案2 1 已采納 2010-09-27 06:06:35

解決方案3 0 2010-09-26 09:54:10

解決方案1
2 2010-09-26 09:19:59

解決方案2
1 已采納 2010-09-27 06:06:35

解決方案3
0 2010-09-26 09:54:10