简体   繁体   中英

PHP: Auto-generated XML carriage return entities appear w/ SimpleXML and xpath

I'm using SimpleXML and xpath to read elements from an external UTF-8 XHTML document. I then iteratively echo the output of SimpleXML's asXML() function executed upon each element returned from an xpath selector. But the XML carriage return entity is annoyingly inserted after every line of my code . There aren't any extra characters in the XHTML document. What is causing this? It seems to be the way I'm iterating through each array element returned from xpath . I don't get the entities when I'm just outputting one element directly from SimpleXML's asXML() (without using xpath).

<?php
$content = new DOMDocument();
$content->loadHTMLFile(CONTENT.html);
$story = simplexml_import_dom($content->getElementById('story'));
$topics = $story->xpath('div[@class="topic"]');
foreach ($topics as $topic) {
    $topicContents = $topic->xpath('div/child::node()'); // Array of elements within 'content'.
    foreach ($topicContents as $contentElement) {
        echo $contentElement->asXML();
    }
}
?>

Excerpt from outputted XHTML code with auto-generated XML carriage returns:

<div class="content">&#13;
<p>Lorem ipsum dolor sit amet</p>&#13;
<h2>Lorem ipsum</h2>&#13;
<p>Lorem ipsum dolor sit amet</p>&#13;
<ul>
    <li>Lorem ipsum</li>&#13;
    <li>Lorem ipsum</li>&#13;
    <li>Lorem ipsum</li>&#13;

That's how libxml treats \\r in text nodes.

<?php
$xml = <<< XML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en">
    <head>
        <title>...</title>
    </head>
    <body><pre>a\nbc</pre></body>
</html>
XML;
$content = new DOMDocument(); $content->loadhtml($xml); $content = simplexml_import_dom($content); echo $content->asxml();
prints
 <html lang="en"><head><title>...</title></head><body><pre>a \nb  \nc</pre></body></html>  
(the \\n characters are "left alone" while the \\r\\n is handled as &#13;\\n)
I'm not an XML expert but I think according to http://www.w3.org/TR/REC-xml/#sec-line-ends
To simplify the tasks of applications, the XML processor MUST behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.
it should treat the \\r\\n as a single \\n but it doesn't.
If it doesn't cause you serious trouble just live with it...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM