HTML input to XSLT 2.0: what to do with &nbsp?

Question

There are many question about preserving &nbsp using XSLT, how to output space characters in XSLT, and & in CDATA input. This is a different problem: I have HTML files that contain &nbsp and I want to convert them to XML. I cannot figure out how to read the input using the Saxon XSLT 2.0 processor. This is for a text mining application, so I have no control over the input.

Here is example text from the input:

<P STYLE="line-height:0px;margin-top:0px;margin-bottom:0px;border-> bottom:0.5pt solid #000000">
&nbsp;
</P>

To start I just want to eliminate all the &nbsp in the output. Once I can do that I will eliminate attributes like STYLE and other HTML constructs.

The problem is that I cannot get Saxon to input the HTML file at all. I get this error.

SXXP0003: Error reported by XML parser: The entity "nbsp" was referenced, but not declared.

Here is my test XSL file:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsl:stylesheet [ <!ENTITY nbsp "&#160;"> ]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
        xmlns="http://www.w3.org/1999/html"
        version="2.0">
  <xsl:output method="xml" omit-xml-declaration="yes" encoding="utf-8"/>
  <xsl:strip-space elements="*"/>
  <!-- copy all elements and their attributes-->
  <xsl:template match="* | @*">
    <xsl:copy><xsl:copy-of select="@*"/><xsl:apply-templates/></xsl:copy>
  </xsl:template>
</xsl:stylesheet>

I'm just learning XSLT now, so there are some constructs I don't completely understand. I think that the DOCTYPE declaration allows the use of &nbsp in the XSL file, not in the input file. I tried changing the DOCTYPE declaration to

<!DOCTYPE xsl:stylesheet [ <!ENTITY html "&#160;"> ]>

That had no effect. I also removed the

xmlns="http://www.w3.org/1999/html"

from the xsl:stylesheet declaration, and it didn't fix the problem.

Clearly I am not the only person to have had this problem. I'm sure there is a simple fix, I just haven't been able to find it. It's keeping me from doing the real work, so I find myself very frustrated. Any help would be greatly appreciated.

Answer 1

Use the saxon:parse-html() extension function to read the HTML and present it as a standard XDM tree.

Alternatively, if you want to use Saxon-HE rather than -PE or -EE, create a SAXSource to read your input in which the XMLReader is an HTML parser such as TagSoup or validator.nu.

Answer 2

If you want to parse HTML and not XML then you have to make sure you have an HTML parser available and you tell Saxon to use it instead of an XML parser. So download either the TagSoup ( http://home.ccil.org/~cowan/tagsoup/ ) or the HTML5 parser ( https://about.validator.nu/htmlparser/ ).

HTML input to XSLT 2.0: what to do with &nbsp?

Question

2 answers

solution1
1 2015-05-31 07:53:50

solution2
0 2015-05-31 07:55:56

HTML input to XSLT 2.0: what to do with &nbsp?

Question

2 answers

solution1 1 2015-05-31 07:53:50

solution2 0 2015-05-31 07:55:56

solution1
1 2015-05-31 07:53:50

solution2
0 2015-05-31 07:55:56