简体   繁体   中英

PHP saveHTML function is not saving HTML properly

I have been trying to save the source code of a section of a webpage using PHP. When I extract the content of whole webpage, the source code order is preserved but when I try to get part of the document using

$dom = new DOMDocument;
$dom->loadHTML($webpage);
$xpath = new DOMXPath($dom);

$query_tag = "//div[contains(@class, 'class-name')]";
$result = $dom->saveHTML($xpath->query($query_tag)->item(0));

The script tag gets messed up. Until now, this is the only website where this issue occurred. Are there some limitations of saveHTML function that I am not aware of?

This is what I should be receiving:

<div id="sponsored-category-header" class="page-header sponsored-category-header clear"> <script type="text/javascript">jQuery(document).ready(function($) {
        var cat_head_params = {"sponsor":"SEO PowerSuite","sponsor_logo":"https:\/\/www.searchenginejournal.com\/wp-content\/plugins\/abm-sej\/includes\/category-images\/SPS_128.png","sponsor_text":"<div class=\"taxonomy-description\">Dominate Google local search results with ease! Get your copy of SEO PowerSuite and keep <a rel=\"nofollow\" href=\"http:\/\/sejr.nl\/PowerSuite-2016-5\" onClick=\"__gaTracker('send', 'event', 'Sponsored Category Click Var 1', 'Local Search', 'SEO PowerSuite');\" target=\"_blank\">your local SEO strategy<\/a> up to par.<\/div>","logo_url":"http:\/\/sejr.nl\/PowerSuite-2016-5","ga_labels":["Local Search","SEO PowerSuite"]}            
        $('#sponsored-category-header').append('<div class="sponsored-category-logo"></div>');
                     $('#sponsored-category-header .sponsored-category-logo').append(' <a rel="nofollow" href="'+cat_head_params.logo_url+'" onClick="__gaTracker(\'send\', \'event\', \'Sponsored Category Click Var 1\', \''+cat_head_params.ga_labels[0]+'\', \''+cat_head_params.ga_labels[0]+'\');" target="_blank"><img class="nopin" src="'+cat_head_params.sponsor_logo+'" width="96" height="96" /></a>');
                                   $('#sponsored-category-header').append('<div class="sponsored-category-details"></div>');
         $('#sponsored-category-header .sponsored-category-details').append('<h3 class="page-title sponsored-category-title">'+cat_head_params.sponsor+'</h3>');
         $('#sponsored-category-header .sponsored-category-details').append(cat_head_params.sponsor_text);


});</script> </div>

This is what I actually get:

<div id="sponsored-category-header" class="page-header sponsored-category-header clear"> <script type="text/javascript">jQuery(document).ready(function($) {
        var cat_head_params = {"sponsor":"SEO PowerSuite","sponsor_logo":"https:\/\/www.searchenginejournal.com\/wp-content\/plugins\/abm-sej\/includes\/category-images\/SPS_128.png","sponsor_text":"<div class=\"taxonomy-description\">Dominate Google local search results with ease! Get your copy of SEO PowerSuite and keep <a rel=\"nofollow\" href=\"http:\/\/sejr.nl\/PowerSuite-2016-5\" onClick=\"__gaTracker('send', 'event', 'Sponsored Category Click Var 1', 'Local Search', 'SEO PowerSuite');\" target=\"_blank\">your local SEO strategy<\/a> up to par.<\/div>","logo_url":"http:\/\/sejr.nl\/PowerSuite-2016-5","ga_labels":["Local Search","SEO PowerSuite"]}            
        $('#sponsored-category-header').append('<div class="sponsored-category-logo"></script>


</div>');
                     $('#sponsored-category-header .sponsored-category-logo').append(' <a rel="nofollow" href="'+cat_head_params.logo_url+'" onclick="__gaTracker(\'send\', \'event\', \'Sponsored Category Click Var 1\', \''+cat_head_params.ga_labels[0]+'\', \''+cat_head_params.ga_labels[0]+'\');" target="_blank"><img class="nopin" src="'+cat_head_params.sponsor_logo+'" width="96" height="96"></a>');
                                   $('#sponsored-category-header').append('<div class="sponsored-category-details"></div>');
         $('#sponsored-category-header .sponsored-category-details').append('<h3 class="page-title sponsored-category-title">'+cat_head_params.sponsor+'</h3>');
         $('#sponsored-category-header .sponsored-category-details').append(cat_head_params.sponsor_text);


    }); </div>

In case you missed it, the ending script tag has moved up a few lines.

Just to be clear, I am not talking about rendered HTML. I am talking about the actual source code that I get after making the request. Any help on how to resolve this issue will be appreciated.

I know that the function saveHTML is causing the issue because when I echo the whole page through PHP, every tag is in the right place.

First of all, your code should be triggering a good bunch of warnings like these:

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity
Warning: DOMDocument::loadHTML(): Unexpected end tag : strong in Entity
Warning: DOMDocument::loadHTML(): Tag header invalid in Entity

This is to expect with on-the-wild HTML (and this page's code is nor particularly bad) but you haven't even mentioned it, what makes me suspect that you might not have error reporting enabled in your development box.

Additionally, the page has huge amounts of JavaScript and DOMDocument is just an HTML parser.

With that, we can get a clear picture of what's happening. Since DOMDocument is not a full-fledged browser it doesn't understand JavaScript code. That means that it detects the <script> tag but it doesn't handle its contents as JavaScript—it merely looks for a closing tag and the first one he finds is this:

$('#sponsored-category-header').append('<div class="sponsored-category-logo"></div>');
                                                                             ^^^^^^

It doesn't know that it's a JavaScript string and should be ignored. Instead, it thinks the wrong tag is being closed so it attempts to fix what's technically invalid HTML and adds the missing </script> tag.

For this precise reason, the <script>...</script> tag set has traditionally been written this way:

<script type="text/javascript"><!--
var foo = '<p>Escaped end tag<\/p>';
//--></script>

... so user agents that are unaware of JavaScript can safely ignore the whole tag ( hey, it's nothing but a good old HTML comment ). However, nowadays it's almost universally considered bad practice because "all browsers understand JavaScript".

Final note: the DOM extension is probably aware of the <script> tag and knows it isn't allowed to have other tags inside. That explains why inner opening tags are not considered.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM