简体   繁体   中英

Find elements by class using Xpath

I am trying to format and filter html into desired format by separating elements of specific class. My html input is as follows:

<body style="background-color:#FFFFFF;margin:0px;padding:0px">
<div class="pdf_page" id="pdf_page1" style="width:707px;height:1024px">
<span class="pdf_text pdf_text0" style="top:50px;left:688px">1</span>
<span class="pdf_text pdf_text1" style="top:119px;left:96px">Healthcare 
Hospitals</span>
<span class="pdf_text pdf_text4" style="top:190px;left:96px">PUBLIC 
HOSPITALS/MEDICAL CLINICS</span>
<span class="pdf_text pdf_text5" style="top:207px;left:96px">Alexandra 
Hospital</span>
<span class="pdf_text pdf_text5" style="top:224px;left:96px">Admiralty 
Medical Centre</span>
<span class="pdf_text pdf_text5" style="top:241px;left:96px">Changi General 
Hospital</span>
<span class="pdf_text pdf_text4" style="top:460px;left:96px">PRIVATE 
HOSPITALS/MEDICAL CLINICS</span>
<span class="pdf_text pdf_text5" style="top:477px;left:96px">Farrer Park 
Hospital</span>
<span class="pdf_text pdf_text5" style="top:494px;left:96px">Fortis Surgical 
Hospital</span>
<span class="pdf_text pdf_text5" style="top:511px;left:96px">Gleneagles 
Hospital</span>
<span class="pdf_text pdf_text4" style="top:662px;left:96px">DAY SURGERY 
CENTRES</span>
<span class="pdf_text pdf_text5" style="top:679px;left:96px">A Clinic For 
Women</span>
<span class="pdf_text pdf_text5" style="top:696px;left:96px">A Company For 
Women</span>
</div>
...

I wrote below snippet to format it so that i can separate all Span's with class as 'pdf_text pdf_text4'

<xsl:template match="/">
  <vce>
<xsl:apply-templates value="body" />
   </vce>
</xsl:template>
<xsl:template match="div">
  <document>
    <content name="header">
      <xsl:value-of select="(//span[contains(@class, 'pdf_text pdf_text4')])" />
    </content>
    <content name="data">
      <xsl:value-of select="." />
</content>
  </document>
</xsl:template>

But with this, I am getting output as follows:

<vce>
<document>
<content name="header">PUBLIC HOSPITALS/MEDICAL CLINICS</content>
<content name="data">
1 Healthcare List of M...
</content>
</document>
<document>
<content name="header">PUBLIC HOSPITALS/MEDICAL CLINICS</content>
<content name="data">
1 Healthcare List of M...
</content>
</document>

If you see above, "PUBLIC HOSPITALS/MEDICAL CLINICS" repeats again and again instead of picking next span content which has matching class.

What I am doing wrong ?

Use

<xsl:value-of select="(descendant-or-self::span[contains(@class, 'pdf_text pdf_text4')])" />

instead of

<xsl:value-of select="(//span[contains(@class, 'pdf_text pdf_text4')])" />

See Transformation at http://xsltransform.net/pNvs5vM

I prepared a script, in version 1.0 , based on template recursion .

The main template (matching "/") calls "normal" template to process only span element with class ...text4 .

This "normal" template for span s first processes the own element (creating header ), then starts processing of following span elements (with class ...text5 ), by calling another template in cell mode, to process the next sibling. Due to recursion, this processing goes on while there is next sibling with class ...text5 .

The "starting" recursive call (from "normal" template) is "enveloped" in <content name="data"> element. For details see below.

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
  <xsl:strip-space elements="*"/>

  <xsl:template match="/">
    <vce><document>
      <xsl:apply-templates select="body/div/span[contains(@class, 'text4')]"/>
    </document></vce>
  </xsl:template>

  <xsl:template match="span">
    <!-- First process the current span (text4) -->
    <content name="header">
      <xsl:value-of select="." />
    </content>
    <!-- Then, recursively, text5, starting from the next -->
    <content name="data">
      <xsl:apply-templates select="following-sibling::*[1]" mode="cell"/>
    </content>
  </xsl:template>

  <!-- Recursive processing of text5 spans -->
  <xsl:template match="span" mode="cell">
    <!-- Process the current span -->
    <xsl:value-of select="."/>
    <!-- Find the next span (if any) -->
    <xsl:variable name="nextItem" select="following-sibling::*[1][self::span]
      [contains(@class, 'text5')]"/>
    <!-- Next span found -->
    <xsl:if test="$nextItem">
      <!-- Separator -->
      <xsl:text>, </xsl:text>
      <!-- Process the next span -->
      <xsl:apply-templates select="$nextItem" mode="cell"/>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM