简体   繁体   中英

PHP's DomElement->nodeValue has gobbly-gook

I'm parsing a third-party web page using PHP's DOMElement controls. When I use the web page with my browser and view the source, it's clean, but when I access some of the nodes through the DOMElement->nodeValue parameter the HTML tags aren't there, and there are several newlines and this character Â. According to this answer , this is the character that shows up when there's an encoding issue.

I also get that gobbly-gook using:

  • simplexml_import_dom($node)->asXML();
  • $doc->saveXML($node);

My question is how I can simply get the clean HTML code inside the DOMElement?

Here is the clean HTML code:

<b>Author:</b> AUTHOR<br>
            <b>ISBN:</b> 9780684857220 <br>
            <b>Edition/Copyright:</b> 7<br>
            <b>Publisher:</b> J+M<br>
            <b>Published Date:</b>  1989<br>

Here is what nodeValue gives:

                    Â 
                    Author:Â AUTHOR      ISBN:Â 9780684857220 Edition/Copyright:Â 7     Publisher:Â J+M       Published Date:Â 
                    1989

Have you tried specifying the encoding when you create the DOM document? For example:

$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadXML($third_party_web_page_string);

or

$doc = new DOMDocument('1.0', 'iso-8859-1');
$doc->loadXML($third_party_web_page_string);

If neither of those work, you could try using the iconv function over the data before you load it into the DOM object.

Turns out it wasn't an encoding issue but rather I was using the wrong methods. This works:

$doc = new DOMDocument();
$doc->appendChild($doc->importNode($second_td,true)); 
echo $doc->saveHTML();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM