简体   繁体   中英

Unexpected escaped CRs inserted with XSLT output-method=“text” transform

My question here is what the logic of the following behavior might be, or if it's a bug (in MSXML6 under Windows) even what failure of logic could underpin such a bug.

Consider the input XML file.

<?xml version="1.0" encoding="utf-8"?>
<root>
    <item>first item</item>
    <item>second item</item>
</root>

The following XSLT attempts to extract the items in text format, one per line, with the standard Windows CR-LF line endings.

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE xsl:stylesheet [<!ENTITY eol "<![CDATA[&#xD;&#xA;]]>">]> <!-- (a) !?? -->

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" version="1.0" encoding="utf-8" media-type="text/plain"/>
<xsl:strip-space elements='*'/>
<xsl:template match="item"> <!-- list items, one per line -->
    <xsl:value-of select="."/>
    <xsl:text disable-output-escaping="yes">&eol;</xsl:text>
</xsl:template>
</xsl:stylesheet>

However, the output that I am getting includes extraneous escaped CRs literally output as "&#13;" at the end of each line.

first item&#13;
second item&#13;

The question, again, is about the particular behavior above, which I find quite odd. I am specifically not asking for alternatives or workarounds, in fact variations thereof look to be working fine.

<!DOCTYPE xsl:stylesheet [<!ENTITY eol "<![CDATA[&#xA;]]>">]> <!-- (b) works  -->
<!DOCTYPE xsl:stylesheet [<!ENTITY eol "&amp;#xA;">]>         <!-- (c) no newlines in output -->
<!DOCTYPE xsl:stylesheet [<!ENTITY eol "&#x26;#xA;">]>        <!-- (d) works  -->
<!DOCTYPE xsl:stylesheet [<!ENTITY eol "&#xA;">]>             <!-- (e) no newlines in output -->
<!DOCTYPE xsl:stylesheet [<!ENTITY eol "&#xD;&#xA;">]>        <!-- (f) works  -->


[ EDIT ] Following is the minimal JScript code to duplicate the issue.

 var vArgs = WScript.Arguments; var xmlFile = vArgs(0); var xslFile = vArgs(1); var xmlDOMDocProgID = "MSXML2.DOMDocument.6.0"; var xmlDoc = new ActiveXObject(xmlDOMDocProgID); xmlDoc.setProperty("NewParser", true); xmlDoc.validateOnParse = false; xmlDoc.async = false; xmlDoc.load(xmlFile); var xslDoc = new ActiveXObject(xmlDOMDocProgID); xslDoc.setProperty("NewParser", true); xslDoc.setProperty("ProhibitDTD", false); xslDoc.validateOnParse = false; xslDoc.async = false; xslDoc.load(xslFile); WScript.StdOut.Write(xmlDoc.transformNode(xslDoc)); 

Assuming it's saved as test.js and the xml/xslt files are test.xml and test.xslt respectively, the transformation at the cmd prompt can be run as,,,

 C:\\etc>cscript //nologo test.js test.xml test.xslt first item&#13; second item&#13; C:\\etc> 

I think it is a bug of MSXML 6 and the "new parser" you enable there with xslDoc.setProperty("NewParser", true); . Even without using any XSLT at all you can load a document like

<!DOCTYPE root [<!ENTITY eol "<![CDATA[&#xD;&#xA;]]>">]>
<root>&eol;</root>

with MSXML 6 and the "new parser" and check the text property of the root/document element

var xmlDOMDocProgID = "MSXML2.DOMDocument.6.0";

var xmlDoc = new ActiveXObject(xmlDOMDocProgID);
xmlDoc.setProperty("NewParser", true);
xmlDoc.setProperty("ProhibitDTD", false);
xmlDoc.validateOnParse = false;
xmlDoc.load('cdata-input2.xml');

WScript.Echo(xmlDoc.documentElement.text);

and it shows &#13; .

If you also output WScript.Echo(xmlDoc.documentElement.firstChild.firstChild.nodeValue); you get the same value so somehow the entity parsing ends up "converting" the <!ENTITY eol "<![CDATA[&#xD;&#xA;]]>"> from the DTD subset and the &eol; into an entity reference node containing a CDATA section node with a node value where the escaped hexadecimal character reference &#xD; is now an escaped decimal one &#13; .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM