I am trying to format and filter html into desired format by separating elements of specific class. My html input is as follows:
<body style="background-color:#FFFFFF;margin:0px;padding:0px">
<div class="pdf_page" id="pdf_page1" style="width:707px;height:1024px">
<span class="pdf_text pdf_text0" style="top:50px;left:688px">1</span>
<span class="pdf_text pdf_text1" style="top:119px;left:96px">Healthcare
Hospitals</span>
<span class="pdf_text pdf_text4" style="top:190px;left:96px">PUBLIC
HOSPITALS/MEDICAL CLINICS</span>
<span class="pdf_text pdf_text5" style="top:207px;left:96px">Alexandra
Hospital</span>
<span class="pdf_text pdf_text5" style="top:224px;left:96px">Admiralty
Medical Centre</span>
<span class="pdf_text pdf_text5" style="top:241px;left:96px">Changi General
Hospital</span>
<span class="pdf_text pdf_text4" style="top:460px;left:96px">PRIVATE
HOSPITALS/MEDICAL CLINICS</span>
<span class="pdf_text pdf_text5" style="top:477px;left:96px">Farrer Park
Hospital</span>
<span class="pdf_text pdf_text5" style="top:494px;left:96px">Fortis Surgical
Hospital</span>
<span class="pdf_text pdf_text5" style="top:511px;left:96px">Gleneagles
Hospital</span>
<span class="pdf_text pdf_text4" style="top:662px;left:96px">DAY SURGERY
CENTRES</span>
<span class="pdf_text pdf_text5" style="top:679px;left:96px">A Clinic For
Women</span>
<span class="pdf_text pdf_text5" style="top:696px;left:96px">A Company For
Women</span>
</div>
...
I wrote below snippet to format it so that i can separate all Span's with class as 'pdf_text pdf_text4'
<xsl:template match="/">
<vce>
<xsl:apply-templates value="body" />
</vce>
</xsl:template>
<xsl:template match="div">
<document>
<content name="header">
<xsl:value-of select="(//span[contains(@class, 'pdf_text pdf_text4')])" />
</content>
<content name="data">
<xsl:value-of select="." />
</content>
</document>
</xsl:template>
But with this, I am getting output as follows:
<vce>
<document>
<content name="header">PUBLIC HOSPITALS/MEDICAL CLINICS</content>
<content name="data">
1 Healthcare List of M...
</content>
</document>
<document>
<content name="header">PUBLIC HOSPITALS/MEDICAL CLINICS</content>
<content name="data">
1 Healthcare List of M...
</content>
</document>
If you see above, "PUBLIC HOSPITALS/MEDICAL CLINICS" repeats again and again instead of picking next span content which has matching class.
What I am doing wrong ?
Use
<xsl:value-of select="(descendant-or-self::span[contains(@class, 'pdf_text pdf_text4')])" />
instead of
<xsl:value-of select="(//span[contains(@class, 'pdf_text pdf_text4')])" />
See Transformation at http://xsltransform.net/pNvs5vM
I prepared a script, in version 1.0 , based on template recursion .
The main template (matching "/") calls "normal" template to process only span
element with class ...text4 .
This "normal" template for span s first processes the own element (creating header ), then starts processing of following span
elements (with class ...text5 ), by calling another template in cell mode, to process the next sibling. Due to recursion, this processing goes on while there is next sibling with class ...text5 .
The "starting" recursive call (from "normal" template) is "enveloped" in <content name="data">
element. For details see below.
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<vce><document>
<xsl:apply-templates select="body/div/span[contains(@class, 'text4')]"/>
</document></vce>
</xsl:template>
<xsl:template match="span">
<!-- First process the current span (text4) -->
<content name="header">
<xsl:value-of select="." />
</content>
<!-- Then, recursively, text5, starting from the next -->
<content name="data">
<xsl:apply-templates select="following-sibling::*[1]" mode="cell"/>
</content>
</xsl:template>
<!-- Recursive processing of text5 spans -->
<xsl:template match="span" mode="cell">
<!-- Process the current span -->
<xsl:value-of select="."/>
<!-- Find the next span (if any) -->
<xsl:variable name="nextItem" select="following-sibling::*[1][self::span]
[contains(@class, 'text5')]"/>
<!-- Next span found -->
<xsl:if test="$nextItem">
<!-- Separator -->
<xsl:text>, </xsl:text>
<!-- Process the next span -->
<xsl:apply-templates select="$nextItem" mode="cell"/>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.