Remove all html tags except allowed tags using XSLT function

Question

Am trying to cleanup some data that we get from rss feeds using XSLT.I want to remove all tags except the p tag.

 Cows are kool.<p>The <i>milk</i> <b>costs</b> $1.99.</p>

I have few doubts here on how to solve this using XSLT in either 1.0 or 2.0.

1)I have seen this example https://maulikdhorajia.blogspot.in/2011/06/removing-html-tags-using-xslt.html

But I need the p tags to be present and for which I need to use a regex.Can we use string-before-match function and do in a similar way.This function I think is not present in xpath.

2)I understand that replace function cannot be used for this as it expects a string and if we pass any node it extracts the content and then passes it to the function and in this case defeats the purpose of removing tags.

I was little confused as in this answer ,a replace was used https://stackoverflow.com/a/18528749/745018 .

3)Am doing this in nginx server using xslt.

Please find below sample input which we get in the body tag of the rss feed.

<p>The Supreme Court issued on Friday a bailable warrant against sitting Calcutta high court justice CS Karnan, an unprecedented order in a bitter confrontation between the judge and the top court.</p><p>A seven-judge bench headed by Chief Justice of India JS Khehar issued the order directing Karnan’s presence on <h2>March 31</h2> because the judge ignored an earlier court order summoning him.<i>Justice Karnan</i> had to appear</p>

Update : Also I am looking for an xslt function for this

Answer 1

Assuming you can use XSLT 2.0 then you could apply David Carlisle's HTML parser ( https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl ) to the contents of body elements and then process the resulting nodes in a mode that strips every element but p elements:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
    xmlns:d="data:,dpc"
    xmlns:xhtml="http://www.w3.org/1999/xhtml"
    exclude-result-prefixes="d xhtml">

    <xsl:import href="htmlparse-by-dcarlisle.xsl"/>

    <xsl:template match="@*|node()" mode="#default strip">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" mode="#current"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="body">
        <xsl:copy>
            <xsl:apply-templates select="d:htmlparse(., '', true())" mode="strip"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*[not(self::p)]" mode="strip">
        <xsl:apply-templates/>
    </xsl:template>

</xsl:transform>

For the input

<rss>
    <entry>
        <body><![CDATA[<p>The Supreme Court issued on Friday a bailable warrant against sitting Calcutta high court justice CS Karnan, an unprecedented order in a bitter confrontation between the judge and the top court.</p><p>A seven-judge bench headed by Chief Justice of India JS Khehar issued the order directing Karnan’s presence on <h2>March 31</h2> because the judge ignored an earlier court order summoning him.<i>Justice Karnan</i> had to appear</p>]]></body>
    </entry>
</rss>

that gives

<rss>
    <entry>
        <body><p>The Supreme Court issued on Friday a bailable warrant against sitting Calcutta high court justice CS Karnan, an unprecedented order in a bitter confrontation between the judge and the top court.</p><p>A seven-judge bench headed by Chief Justice of India JS Khehar issued the order directing Karnan’s presence on March 31 because the judge ignored an earlier court order summoning him.Justice Karnan had to appear</p></body>
    </entry>
</rss>

If the input is not escaped but rather contained as XML in the input then you don't need to parse it but can just apply the mode to the contents:

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

    <xsl:template match="@*|node()" mode="#default strip">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" mode="#current"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="body">
        <xsl:copy>
            <xsl:apply-templates select="node()" mode="strip"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*[not(self::p)]" mode="strip">
        <xsl:apply-templates/>
    </xsl:template>

</xsl:transform>

http://xsltransform.net/gWEamMc/1

Remove all html tags except allowed tags using XSLT function

Question

1 answers

solution1
4 ACCPTED 2017-03-10 12:54:10

Remove all html tags except allowed tags using XSLT function

Question

1 answers

solution1 4 ACCPTED 2017-03-10 12:54:10

solution1
4 ACCPTED 2017-03-10 12:54:10