使用XSLT函數刪除除允許的標簽之外的所有html標簽

Question

我正在嘗試使用XSLT清理從rss提要中獲得的一些數據。我想刪除除p標簽之外的所有標簽。

 Cows are kool.<p>The <i>milk</i> <b>costs</b> $1.99.</p>

我對如何在1.0或2.0中使用XSLT解決此問題毫不懷疑。

1）我已經看到了這個示例https://maulikdhorajia.blogspot.in/2011/06/removing-html-tags-using-xslt.html

但是我需要存在p標記，並且需要使用一個正則表達式。我們可以使用string-before-match函數並以類似的方式進行操作嗎？我認為該功能在xpath中不存在。

2）我知道replace函數不能用於此目的，因為它期望一個字符串，並且如果我們傳遞任何節點，它將提取內容，然后將其傳遞給該函數，在這種情況下將無法達到刪除標簽的目的。

我對此感到有點困惑，使用了替換https://stackoverflow.com/a/18528749/745018 。

3）在使用xslt的nginx服務器中執行此操作。

請在下面的示例輸入中找到我們輸入的rss feed的body標簽。

<p>The Supreme Court issued on Friday a bailable warrant against sitting Calcutta high court justice CS Karnan, an unprecedented order in a bitter confrontation between the judge and the top court.</p><p>A seven-judge bench headed by Chief Justice of India JS Khehar issued the order directing Karnan’s presence on <h2>March 31</h2> because the judge ignored an earlier court order summoning him.<i>Justice Karnan</i> had to appear</p>

更新：我也在為此尋找一個xslt函數

Answer 1

假設您可以使用XSLT 2.0，則可以將David Carlisle的HTML解析器（ https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl ）應用於body元素的內容，然后處理生成的內容以除去p元素之外的每個元素的模式的節點：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
    xmlns:d="data:,dpc"
    xmlns:xhtml="http://www.w3.org/1999/xhtml"
    exclude-result-prefixes="d xhtml">

    <xsl:import href="htmlparse-by-dcarlisle.xsl"/>

    <xsl:template match="@*|node()" mode="#default strip">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" mode="#current"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="body">
        <xsl:copy>
            <xsl:apply-templates select="d:htmlparse(., '', true())" mode="strip"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*[not(self::p)]" mode="strip">
        <xsl:apply-templates/>
    </xsl:template>

</xsl:transform>

對於輸入

<rss>
    <entry>
        <body><![CDATA[<p>The Supreme Court issued on Friday a bailable warrant against sitting Calcutta high court justice CS Karnan, an unprecedented order in a bitter confrontation between the judge and the top court.</p><p>A seven-judge bench headed by Chief Justice of India JS Khehar issued the order directing Karnan’s presence on <h2>March 31</h2> because the judge ignored an earlier court order summoning him.<i>Justice Karnan</i> had to appear</p>]]></body>
    </entry>
</rss>

這給

<rss>
    <entry>
        <body><p>The Supreme Court issued on Friday a bailable warrant against sitting Calcutta high court justice CS Karnan, an unprecedented order in a bitter confrontation between the judge and the top court.</p><p>A seven-judge bench headed by Chief Justice of India JS Khehar issued the order directing Karnan’s presence on March 31 because the judge ignored an earlier court order summoning him.Justice Karnan had to appear</p></body>
    </entry>
</rss>

如果輸入不是轉義的，而是作為XML包含在輸入中的，則無需解析它，而只需將模式應用於內容即可：

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

    <xsl:template match="@*|node()" mode="#default strip">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" mode="#current"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="body">
        <xsl:copy>
            <xsl:apply-templates select="node()" mode="strip"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*[not(self::p)]" mode="strip">
        <xsl:apply-templates/>
    </xsl:template>

</xsl:transform>

http://xsltransform.net/gWEamMc/1

使用XSLT函數刪除除允許的標簽之外的所有html標簽

問題描述

1 個解決方案

解決方案1
4 已采納 2017-03-10 12:54:10

使用XSLT函數刪除除允許的標簽之外的所有html標簽

問題描述

1 個解決方案

解決方案1 4 已采納 2017-03-10 12:54:10

解決方案1
4 已采納 2017-03-10 12:54:10