XSLT: Parse string as XML Node Set (concret: transform HTML-String to a node-set)?

Question

I am facing the problem that I have in XML a node with a string, representing HTML. I need to cut this string, but, of course, this can result in an invalide HTML-Markup (eg if I cut the string always after 30 characters I can easily lose closing elements like </ul> , etc.). What to do? It seems to be very difficult, because I don't find real help via google.

My idea so far: with "analyze-string" and regex select nodes and contents and write them es XML-element-nodes. But I get big problems to handle all cases, especially the nested nodes.

Does someone has some ideas?

FYI: My notepad:

Regex: Catch first tag
Read tagname of first node
Put tagname in regex and search whole tag, select also the rest of the string (to go on with it later)
Check content: More tags? yes: -> Step 1, no: -> Step 5
Write tag as node-element
Take rest of the string -> Step 1

Here is the XML-doc:

<?xml version="1.0" encoding="UTF-8"?>
<html>
    <data>
        <![CDATA[
        <h2>header</h2><p>A little article. <b>Here</b> it's already done!</p>
        ]]>
    </data>
</html>

What I want to do:

In I have a string (html) and just want to output a special amount of characters (eg the first 25). When I do this just on the string, I get this result:

"<h2>header</h2><p>A little article"

In a next step I put this string in an HTML-output, but on this point I get invalid markup because the <p> -Tag is not closed.

So my first approach: Parsing this string to get XML-Nodes for each tag, and then go over each node, write an xml-element (to make sure the final tag will be valid) and copy as many characters until reached the limit, in this example would be 25 characters.

Answer 1

If you have an XML node which represents HTML, then this should have been entity encoded, ie open and close braces converted to < and > - this means you can cut it wherever you like and still have a valid XML document.

Answer 2

As you mention analyze-string you seem to use XSLT 2.0. That way you have two options, with Saxon 9 there is an extension function http://www.saxonica.com/documentation/extensions/functions/parse.xml (and even one http://www.saxonica.com/documentation/extensions/functions/parse-html.xml in case you want to parse HTML), and then there is David Carlisle's pure XSLT 2.0 implementation of an HTML parser http://code.google.com/p/web-xslt/source/browse/trunk/htmlparse you can import in your stylesheet and then use on the contents of your data element.

XSLT: Parse string as XML Node Set (concret: transform HTML-String to a node-set)?

Question

2 answers

solution1
0 2012-01-03 09:50:18

solution2
0 2012-01-03 11:10:52

XSLT: Parse string as XML Node Set (concret: transform HTML-String to a node-set)?

Question

2 answers

solution1 0 2012-01-03 09:50:18

solution2 0 2012-01-03 11:10:52

solution1
0 2012-01-03 09:50:18

solution2
0 2012-01-03 11:10:52