简体   繁体   English

PHP 5.4.16 DOMDocument删除了部分Javascript

[英]PHP 5.4.16 DOMDocument removes parts of Javascript

I try to load an HTML page from a remote server into a PHP script, which should manipulate the HTML with the DOMDocument class. 我尝试将HTML页面从远程服务器加载到PHP脚本中,该脚本应使用DOMDocument类处理HTML。 But I have seen, that the DOMDocument class removes some parts of the Javascript, which comes with the HTML page. 但是我已经看到,DOMDocument类删除了HTML页面随附的Javascript的某些部分。 There are some things like: 有一些事情像:

<script type="text/javascript">
//...
function printJSPage() {
    var printwin=window.open('','haha','top=100,left=100,width=800,height=600');
    printwin.document.writeln(' <table border="0" cellspacing="5" cellpadding="0" width="100%">');
    printwin.document.writeln(' <tr>');
    printwin.document.writeln(' <td align="left" valign="bottom">');
    //...
    printwin.document.writeln('</td>');
    //...
}
</script>

But the DOMDocument changes ie the line 但是DOMDocument改变了,即行

printwin.document.writeln('</td>');

to

printwin.document.writeln(' ');

and also a lot of others things (ie the last script tag is no longer there. As the result I get a complete destroyed page, which I cannot send further. 以及其他很多事情(例如,最后一个脚本标记不再存在。结果,我得到了一个完整的销毁页面,无法进一步发送。

So I think, DOMDocument has problems with the HTML tags within the Javascript code and tries to correct the code, to produce a well-formed document. 因此,我认为DOMDocument在Javascript代码中的HTML标记方面存在问题,并尝试更正该代码以生成格式正确的文档。 Can I prevent the Javascript parsing within DOMDocument? 我可以阻止DOMDocument中的Javascript解析吗?

The PHP code fragment is: PHP代码片段为:

$stdin = file_get_contents('php://stdin');
$dom = new \DOMDocument();
@$dom->loadHTML($stdin);
return $dom->saveHTML();   // will produce wrong HTML
//return $stdin;           // will produce correct HTML

I have stored both HTML versions and have compared both with Meld. 我已经存储了两个HTML版本,并与Meld进行了比较。

I also have tested 我也测试过

@$dom->loadXML($stdin);
return $dom->saveHTML();

but I don't get any things back from the object. 但是我没有从物体上得到任何东西。

Here's a hack that might be helpful. 这可能会有所帮助。 The idea is to replace the script contents with a string that's guaranteed to be valid HTML and unique then replace it back. 这个想法是用保证有效的HTML和唯一的字符串替换脚本内容,然后将其替换。

It replaces all contents inside script tags with the MD5 of those contents and then replaces them back. 它将脚本标记内的所有内容替换为这些内容的MD5,然后将其替换回来。

$scriptContainer = [];
$str = preg_replace_callback ("#<script([^>]*)>(.*?)</script>#s", function ($matches) use (&$scriptContainer) {
     $scriptContainer[md5($matches[2])] = $matches[2];
        return "<script".$matches[1].">".md5($matches[2])."</script>";
    }, $str);
$dom = new \DOMDocument();
@$dom->loadHTML($str);
$final = strtr($dom->saveHTML(), $scriptContainer); 

Here strtr is just convenient due to the way the array is formatted, using str_replace(array_keys($scriptContainer), $scriptContainer, $dom->saveHTML()) would also work. 由于使用str_replace(array_keys($scriptContainer), $scriptContainer, $dom->saveHTML())格式化数组的方式, strtr在这里非常方便。

I find it very suprising that PHP does not properly parse HTML content. 我非常惊讶PHP无法正确解析HTML内容。 It seems to instead be parsing XML content (wrongly so as well because CDATA content is parsed instead of being treated literally). 它似乎是在解析XML内容(这也是错误的,因为CDATA内容是解析的,而不是按字面值处理)。 However it is what it is and if you want a real document parser then you should probably look into a Node.js solution with jsdom 但是就是这样,如果您想要一个真正的文档解析器,那么您应该考虑使用jsdom的Node.js解决方案

If you have a <script> within a <script> , the following (not so smart) solution will handle that. 如果<script>中有<script> ,则以下(不是很聪明)的解决方案将处理该问题。 There is still a problem: if the <script> tags are not balanced, the solution will not work. 仍然存在一个问题:如果<script>标记不平衡,则解决方案将不起作用。 This could occur, if your Javascript uses String.fromCharCode to print the String </script> . 如果您的Javascript使用String.fromCharCode来打印String </script> ,则会发生这种情况。

$scriptContainer = array();

function getPosition($tag) {
    return $tag[0][1];
}

function getContent($tag) {
    return $tag[0][0];
}

function isStart($tag) {
    $x = getContent($tag);
    return ($x[0].$x[1] === "<s");
}

function isEnd($tag) {
    $x = getContent($tag);
    return ($x[0].$x[1] === "</");
}

function mask($str, $scripts) {
    global $scriptContainer;

    $res = "";
    $start = null;
    $stop = null;
    $idx = 0;

    $count = 0;
    foreach ($scripts as $tag) {

            if (isStart($tag)) {
                    $count++;
                    $start = ($start === null) ? $tag : $start;
            }

            if (isEnd($tag)) {
                    $count--;
                    $stop = ($count == 0) ? $tag : $stop;
            }

            if ($start !== null && $stop !== null) {
                    $res .= substr($str, $idx, getPosition($start) - $idx);
                    $res .= getContent($start);
                    $code = substr($str, getPosition($start) + strlen(getContent($start)), getPosition($stop) - getPosition($start) - strlen(getContent($start)));
                    $hash = md5($code);
                    $res .= $hash;
                    $res .= getContent($stop);

                    $scriptContainer[$hash] = $code;

                    $idx = getPosition($stop) + strlen(getContent($stop));
                    $start = null;
                    $stop = null;
            }
    }

    $res .= substr($str, $idx);
    return $res;
}

preg_match_all("#\<script[^\>]*\>|\<\/script\>#s", $html, $scripts, PREG_OFFSET_CAPTURE|PREG_SET_ORDER);
$html = mask($html, $scripts);

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_use_internal_errors(false);

// handle some things within DOM

echo strtr($dom->saveHTML(), $scriptContainer);

If you replace the "script" String within the preg_match_all with "style" you can also mask the CSS styles, which can contain tag names too (ie within comments). 如果将preg_match_all的“ script”字符串替换为“ style”,则还可以屏蔽CSS样式,该样式也可以包含标签名称(即,在注释中)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM