简体   繁体   中英

Parsing XML tags nested within other XML values

I am stuck developing a specific XML parser which parses huge chunk of XML .

My problem is i'm confused how to parse XML tags nested within other XML values. My input file looks something like this.

<main>
<step>
    <para>Calculate the values from the pool</para>
</step>
<step>
        <para>Use these(<internalRef id ="003" xlink:actuate="onRequest" xlink:show="replace" xlink:href="max003"/>) values finally</para>
</step>
</main>

I am able to get the values of the first step tag using xpath. My problem is how to get the second step values using xpath or rather how to identify when a new tag is starting within a value tag.

For Eg, My second step xpath is returning me this result - Use these () values finally

where as my aim is to get- Use these ( max003 ) values finally

The max003 values has to be taken from xlink:href

Addition - I am able to get individual values of id , actuate, show by writing separate xpaths. My question is i need to stuff the max003 value inside the parentheses after these and before values after getting the xlink:href value which is max003 and send it across the wire for display. So i am searching for a way to identify where and when the sub node id has started? and a mechanism to stuff it inside the parentheses .

You won't be able to do that using XPath alone. What you have there is mixed content XML, meaning that an element may contain both a text value and sub-elements. You can only reference one of those at a time using XPath, and you also cannot just concat what you get from multiple XPath expressions, since the text value may surround the sub-elements as you state in your example.

I suggest you either use XSLT to transform the document and then query the transformed document using XPath as you do now. An alternative is to write your own parser which is able to handle your nested element properly.

This XSLT will probably work for you (haven't tested thoroughly):

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet 
  version="1.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xlink="http://www.w3.org/1999/xlink">
    <xsl:output method="xml" indent="yes"/>

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="internalRef">
        <xsl:value-of select="@xlink:href"/>
    </xsl:template>
</xsl:stylesheet>

Of course you then need to use an XSLT Processor in order to transform your original document.

And a parser could look like this (note that this is just skeleton code for a StAX parser):

import java.io.StringReader;
import java.util.Iterator;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;

public class ExampleStAXParser {

    private static final int STATE_UNDEFINED = 1;
    private static final int STATE_MAIN = 2;
    private static final int STATE_STEP = 3;
    private static final int STATE_PARA = 4;

    private static final String EL_MAIN = "main";
    private static final String EL_STEP = "step";
    private static final String EL_PARA = "para";
    private static final String EL_INTERNAL_REF = "internalRef";
    private static final String ATT_HREF = "href";

    private int state = STATE_UNDEFINED;
    private String characters;

    public void parse(String xmlString) throws XMLStreamException, Exception {


        XMLEventReader reader = null;
        try {
            if (xmlString == null || xmlString.isEmpty()) {
                throw new IllegalArgumentException("Illegal initializiation (xmlString is null or empty)");
            }
            StringReader stringReader = new StringReader(xmlString);
            XMLInputFactory inputFact = XMLInputFactory.newInstance();
            XMLStreamReader streamReader = inputFact.createXMLStreamReader(stringReader);
            reader = inputFact.createXMLEventReader(streamReader);

            while (reader.hasNext()) {
                XMLEvent event = reader.nextEvent();

                if (event.isCharacters()) {
                    characters(event);
                }
                if (event.isStartElement()) {
                    startElement(event);
                    // handle attributes
                    Iterator<Attribute> attributes = event.asStartElement().getAttributes();
                    while(attributes.hasNext()) {
                        attribute(attributes.next());
                    }
                }
                if (event.isEndElement()) {
                    endElement(event);
                }
                if (event.isStartDocument()) {
                    startDocument(event);
                }
                if (event.isEndDocument()) {
                    endDocument(event);
                }

            }            
        } catch (XMLStreamException ex) {
            throw ex;
        } finally {
            try {
                if (reader != null) {
                    reader.close();
                }
            } catch (XMLStreamException ex) {
            }
        }
    }

    private void attribute(XMLEvent event) throws Exception {
        if (state == STATE_PARA) {
            Attribute attr = (Attribute) event;
            String name = attr.getName().getLocalPart();
            if (ATT_HREF.equals(name)) {
                if (characters == null) {
                    characters = attr.getValue();
                } else {
                     characters += attr.getValue();
                }
            }
        } else
            throw new Exception("unexpected attribute");
    }

    private void characters(XMLEvent event) throws Exception {
        Characters asCharacters = event.asCharacters();
        if (asCharacters.isWhiteSpace())
            return;
        if (state == STATE_PARA) {            
            if (characters == null) {
                characters = asCharacters.getData();
            } else {
                 characters += asCharacters.getData();
            }
        } else
            throw new Exception("unexpected attribute");
    }

    private void startElement(XMLEvent event) throws Exception {
        StartElement startElement = event.asStartElement();
        String name = startElement.getName().getLocalPart();
        switch (state) {
            case STATE_UNDEFINED:
                if (name.equals(EL_MAIN)) {
                    state = STATE_MAIN;
                    System.out.println("Element: " + name);
                } else
                    throw new Exception("unexpected element");
                break;
            case STATE_MAIN:
                if (name.equals(EL_STEP)) {
                    state = STATE_STEP;
                    System.out.println("Element: " + name);
                } else
                    throw new Exception("unexpected element");
                break;
            case STATE_STEP:
                if (name.equals(EL_PARA)) {
                    state = STATE_PARA;
                    System.out.println("Element: " + name);
                } else
                    throw new Exception("unexpected element");
                break;
            case STATE_PARA:
                if (name.equals(EL_INTERNAL_REF)) {
                    System.out.println("Element: " + name);
                } else
                    throw new Exception("unexpected element");
                break;
            default:
                throw new Exception("unexpected element");
        }
    }

    private void endElement(XMLEvent event) throws Exception {
        EndElement endElement = event.asEndElement();
        String name = endElement.getName().getLocalPart();
        switch (state) {
            case STATE_MAIN:
                if (name.equals(EL_MAIN)) {
                    state = STATE_UNDEFINED;
                } else
                    throw new Exception("unexpected element");
                break;
            case STATE_STEP:
                if (name.equals(EL_STEP)) {
                    state = STATE_MAIN;
                } else
                    throw new Exception("unexpected element");
                break;
            case STATE_PARA:
                if (name.equals(EL_INTERNAL_REF)) {
                    // do nothing
                } else if (name.equals(EL_PARA)) {
                    System.out.println("Value: " + String.valueOf(characters));
                    characters = null;
                    state = STATE_STEP;
                } else
                    throw new Exception("unexpected element");
                break;
            default:
                throw new Exception("unexpected element");
        }
    }

    private void startDocument(XMLEvent event) {
        System.out.println("Parsing started");
    }

    private void endDocument(XMLEvent event) {
        System.out.println("Parsing ended");
    }

    public static void main(String[] argv) throws XMLStreamException, Exception {
        String xml = "";
        xml += "<main>";
        xml += "<step>";
        xml += "    <para>Calculate the values from the pool</para>";
        xml += "</step>";
        xml += "<step>";
        xml += "        <para>Use these(<internalRef id =\"003\" actuate=\"onRequest\" show=\"replace\" href=\"max003\"/>) values finally</para>";
        xml += "</step>";
        xml += "</main>";

        ExampleStAXParser parser = new ExampleStAXParser();
        parser.parse(xml);
    }
}

The evaluation of this Xpath expression:

 concat(/*/step[2]/para/text()[1],
        /*/step[2]/para/internalRef/@xlink:href,
        /*/step[2]/para/text()[2])

on the provided XML document (corrected to be namespace-wellformed):

<main xmlns:xlink="Undefined namespace">
    <step>
        <para>Calculate the values from the pool</para>
    </step>
    <step>
        <para>Use these(<internalRef id ="003" xlink:actuate="onRequest" xlink:show="replace" xlink:href="max003"/>) values finally</para>
    </step>
</main>

produces the wanted result :

Use these(max003) values finally

Do note : You will need to "register the xlink namespace" with your XPath API, in order for this XPath expression to be evaluated without an error.

XSLT-based verification :

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xlink="Undefined namespace">
 <xsl:output method="text"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
     <xsl:copy-of select=
     "concat(/*/step[2]/para/text()[1],
           /*/step[2]/para/internalRef/@xlink:href,
           /*/step[2]/para/text()[2])
     "/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document (above), the Xpath expression is evaluated and the result of this evaluation is copied to the output :

Use these(max003) values finally

As near as I can tell I think your parser is look at you structure a little like

step
 +- para
     +-id

It's then wrapping the "text" content together be extract that id node...

(This pure speculation)

UPDATE

If I simply walk the node tree (listing each child) this is what I get

 main
  step
    para
      #text - Calculate the values from the pool
  step
    para
      #text - Use these(
      id
      #text - ) values finally

This means that "id" is a child of "para"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM