简体   繁体   中英

Good XSLT for Python - lxml struggles

I'm trying to transform XHTML to text using a user-defined XSLT, which is the following:

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xpath-default-namespace="http://www.w3.org/1999/xhtml">

<xsl:output method="text"/>

<xsl:template match="/html">
Reading document entitled <xsl:value-of select="head/title"/>.

The top menu for this site has the following options:
<xsl:for-each select="body//ul[@role='menubar']/li/a">
    <xsl:value-of select="."></xsl:value-of> <xsl:text>&#xa;</xsl:text>
</xsl:for-each>

Now let's read the main part of the page.
<xsl:for-each select="body//main[@class='container']//(h1 | h2 | h3 | h4 | p | ul/li/a)">
    <xsl:value-of select="normalize-space(.)"/><xsl:text>&#xa;</xsl:text><xsl:text>&#xa;</xsl:text>    
</xsl:for-each>

The footer menu for this site has the following options:
<xsl:for-each select="body//footer[@id='wb-info']//ul/li/a">
    <xsl:value-of select="."></xsl:value-of> <xsl:text>&#xa;</xsl:text>
</xsl:for-each>

</xsl:template>
</xsl:stylesheet>

When I test in http://xsltransform.net/ , applying it a typical HTML, the output is as expected.

I test the same XSLT against the same XHTML using the following Python code:

import lxml.etree as ET

html = ET.parse("../fixed_html/about.html")
xslt = ET.parse("../templates/generic.xslt")
transform = ET.XSLT(xslt)
res = transform(html)
print(res)

I get the following error:

lxml.etree.XSLTParseError: xsl:for-each: could not compile select expression 'body//main[@class='container']//(h1 | h2 | h3 | h4 | p | ul/li/a)'

My first thought is that lxml has limitations. It can't handle valid XSLT. I'm hoping that's not the case, and I just failed to setup the code correctly.

Any issues with the Python code? Can I process the XSLT above in Python some other way?

Your stylesheet declares version="1.0" but the code itself requires an XSLT 2.0 processor:

  1. The xpath-default-namespace attribute is an XSLT 2.0 feature;
  2. In XPath 1.0 parentheses are allowed only in the first location step.

lxml uses the libxslt processor that only supports XSLT 1.0. You will need to rewrite your stylesheet for XSLT 1.0 or find a way to incorporate an XSLT 2.0 or higher processor in your processing chain.


When I test in http://xsltransform.net/ , applying it a typical HTML, the output is as expected.

Only when you select the Saxon 9.5.1 engine. With any other processor you will get an error.

XSLT 2 or 3 for Python is supported by Saxonica's SaxonC 11.1 release, done this month, see details at https://www.saxonica.com/download/c.xml and https://www.saxonica.com/saxon-c/documentation11/index.html#!starting .

At the current stage, you need to compile/build the Python module on your own after downloading the source code and the library modules of SaxonC 11.1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM