简体   繁体   English

DOMDocument从HTML源中删除脚本标记

[英]DOMDocument remove script tags from HTML source

I used @Alex's approach here to remove script tags from a HTML document using the built in DOMDocument. 在这里使用@Alex的方法使用内置的DOMDocument从HTML文档中删除脚本标记。 The problem is if I have a script tag with Javascript content and then another script tag that links to an external Javascript source file, not all script tags are removed from the HTML. 问题是,如果我有一个带有Javascript内容的脚本标记,然后是另一个链接到外部Javascript源文件的脚本标记,则不会从HTML中删除所有脚本标记。

$result = '
<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>
            hey
        </title>
        <script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
        <script>
            alert("hello");
        </script>
    </head>
    <body>hey</body>
</html>
';

$dom = new DOMDocument();
if($dom->loadHTML($result))
{
    $script_tags = $dom->getElementsByTagName('script');

    $length = $script_tags->length;

    for ($i = 0; $i < $length; $i++) {
        if(is_object($script_tags->item($i)->parentNode)) {
            $script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
        }
    }

    echo $dom->saveHTML();
}

The above code outputs: 以上代码输出:

<html>
    <head>
        <meta charset="utf-8">
        <title>hey</title>
        <script>
        alert("hello");
        </script>
    </head>
    <body>
        hey
    </body>
</html>

As you can see from the output, only the external script tag was removed. 从输出中可以看出,只删除了外部脚本标记。 Is there anything I can do to ensure all script tags are removed? 有什么办法可以确保删除所有脚本标记吗?

Your error is actually trivial. 你的错误实际上是微不足道的。 A DOMNode object (and all its descendants - DOMElement , DOMNodeList and a few others!) is automatically updated when its parent element changes, most notably when its number of children change. 一个DOMNode对象(及其所有后代- DOMElementDOMNodeList !和其他一些),当它的父元素的变化,最明显的是当其子女人数变化自动更新。 This is written on a couple of lines in the PHP doc, but is mostly swept under the carpet. 这是在PHP文档的几行中写的,但大多数都是在地毯下。

If you loop using ($k instanceof DOMNode)->length , and subsequently remove elements from the nodes, you'll notice that the length property actually changes! 如果你使用($k instanceof DOMNode)->length循环,然后从节点中删除元素,你会发现length属性实际上发生了变化! I had to write my own library to counteract this and a few other quirks. 我不得不写自己的库来抵消这个和其他一些怪癖。

The solution: 解决方案:

if($dom->loadHTML($result))
{
    while (($r = $dom->getElementsByTagName("script")) && $r->length) {
            $r->item(0)->parentNode->removeChild($r->item(0));
    }
echo $dom->saveHTML();

I'm not actually looping - just popping the first element one at a time. 我实际上并没有循环 - 只需一次弹出第一个元素。 The result: http://sebrenauld.co.uk/domremovescript.php 结果: http//sebrenauld.co.uk/domremovescript.php

To avoid that you get the surprises of a live node list -- that gets shorter as you delete nodes -- you could work with a copy into an array using iterator_to_array : 为了避免您获得实时节点列表的惊喜 - 随着删除节点而缩短 - 您可以使用iterator_to_array将副本复制到数组中:

foreach(iterator_to_array($dom->getElementsByTagName($tag)) as $node) {
    $node->parentNode->removeChild($node);
};  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM