繁体   English   中英

使用DOMDocument解析HTML时的流氓元素

[英]Rogue element when parsing HTML with DOMDocument

假设我的$ html看起来像这样:

<!DOCTYPE html>
<html>
<head>
    <script type="text/javascript">document.createElement("video");document.createElement("audio");document.createElement("track");</script>
    <script type="text/javascript" src="/gui/default/tinymcecontent.js"></script>
    <script type="text/javascript" src="/includes/js/video-js/video.min.js"></script>
    <link rel="stylesheet" href="/includes/js/video-js/video-js.css" />
    <script type="text/javascript">document.createElement("video");document.createElement("audio");document.createElement("track");</script>
    <script type"text/javascript" src="/includes/js/video-js/video.js"></script/>
    <link rel="stylesheet" href="/includes/js/video-js/video-js.css" />
</head>
<body style="font-family: arial;font-size: 12px;">
    <p> </p>
    <table width="100%">        
    </table>
</body>
</html>

当我尝试仅解析带有命令的body标记内的元素时:

$dom = new DOMDocument();

libxml_use_internal_errors(true);
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
libxml_use_internal_errors(false);

$full_dom = $dom->getElementsByTagName('body')->item(0);

的结果

$dom->saveHTML($full_dom)

<body>\n<p>\/&gt;<link rel=\"stylesheet\" href=\"\/includes\/js\/video-js\/video-js.css\"><\/p>\n<p>\u00a0<\/p>\n<table width=\"100%\"><\/table>\n<\/body>

元件

<p>\/&gt;<link rel=\"stylesheet\" href=\"\/includes\/js\/video-js\/video-js.css\"><\/p>

来自哪里? 其他一切都很好,只是此元素从head标签转移到body标签元素。

它来自以下行:

<script type"text/javascript" src="/includes/js/video-js/video.js"></script/>

格式不正确,应为:

<script type="text/javascript" src="/includes/js/video-js/video.js"></script>

您必须在$dom->loadHTML()之后检查错误,以查看发生了什么:

foreach (libxml_get_errors() as $error) {
    print_r($error);
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM