简体   繁体   中英

Xml parsing get data between tags along with parent info

I am trying to write a generic xml parser that parses all xml tags and gets the data and its value into a map as a key-value pair. Sample xml:

<?xml version="1.0"?>
<company>
    <staff>
        <firstname>Kevin</firstname>
        <lastname>Gay</lastname>
        <salary>50000</salary>
    </staff>
</company>

The output is as follows: NodeName:[company] Value:[

        Kevin
        Gay
        50000

]
NodeName:[staff] Value:[
    Kevin
    Gay
    50000
]
NodeName:[firstname] Value:[Kevin]
NodeName:[lastname] Value:[Gay]
NodeName:[salary] Value:[50000]

My code is as follows:

    final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    final DocumentBuilder db = dbf.newDocumentBuilder();
    final ByteArrayInputStream bis = new ByteArrayInputStream(xmlString.getBytes());
    //where xmlString is a file read using DataInputStream.
    final Document doc1 = db.parse(bis);
    printElements(doc1);

void printElements(final Document doc)
{
    final NodeList nl = doc.getElementsByTagName("*");
    Node node;

    for (int i = 0; i < nl.getLength(); i++)
    {
        node = nl.item(i);
        System.out.println("NodeName:[" + node.getNodeName() + "] Value:[" + node.getTextContent() + "]");           
    }
}

How should I eliminate Staff and Company attribute from printing. I do not want to use JAXB or getTags by tag name since the xml tag data will change everytime and I am writing generic xml parser whose job is to parse the tag and its value and put it into a map.

Alo how can I find the parent of the tag that I am parsing so that I can keep track of where the child came from, in this scenario..company0->staff->firstname.

can do it by the following change:

    for (int i=0; i<nodeList.getLength(); i++) 
    {
        // Get element
        Element element = (Element)nodeList.item(i);
        final NodeList nodes = element.getChildNodes();
        if(nodes.getLength() == 1)
        {               
            System.out.println(element.getNodeName() + " " + element.getTextContent());
        }            
    }

JaxB would be a better class to use, but you can try something simple like this:

for (int i = 0; i < nl.getLength(); i++)
{
    node = nl.item(i);

    //check to see if node's name is what you don't want it to be
    if(node.getNodeName().equals("Staff") || node.getNodeName().equals("Comapny"))
    {
        //do stuff or dont do anything...
    }
    else//print other stuff
    {
        System.out.println("NodeName:[" + node.getNodeName() + "] Value:[" + node.getTextContent() + "]");
    }           
}

As far as your second question, I'd recommend looking at the Node API:

http://docs.oracle.com/javase/6/docs/api/org/w3c/dom/Node.html

Hint: getParentNode()

If you want the most depth parts of the parent (firstname, lastname, salary), you can get the very first node. Then do node.getChildNodes() to get a list of children Nodes. Exhaustively search each child until you hit a child with no children itself. Then you know that's a leaf node. You want to print that.

You could use a SAX parser to parse the XML and write your own handler to extend the DefaultHandler.

Keep track of the tags you've read in a Stack, and store the characters you read when characters() is called. When endElement() is called, pop the top tag from the stack which is the tag name, and the last value read in by characters() is the value of this tag. The Strings left in the stack are the parent tags leading up to this tag eg

For a main method reading an XML file:

public static void main(String[] args) {
    File xmlFile = new File("somefile.xml");

    SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();

    MyHandler handler = new MyHandler();

    saxParser.parse(xmlFile, handler);

    Map<String, String> map = handler.getDataMap();
}

Where we have our own handler.

public class MyHandler extends DefaultHandler {
    private String characters = null;
    private Stack<String> tagStack;

    private Map<String, String> dataMap;

    public MyHandler() {
        this.tagStack = new Stack<String>();
        this.dataMap = new HashMap<String, String>();
    }   

    @Override
    public void startElement(String uri, String localName, String qName, 
             Attributes attributes) throws SAXException {
        this.tagStack.push(qName);
    }

    @Override
    public void characters(char[] ch, int start, int length) 
             throws SAXException {
        // trimming to take out whitespace between tags
        characters = new String(ch).trim();
    }

    @Override
    public void endElement(String uri, String localName,
            String qName) throws SAXException {
        // check that the end element we're looking at matches the last read 
        // startElement this should only happen if we don't have well-formed XML
        if (qName.equals(this.tagStack.peek())) {

            String[] tagArray = this.tagStack.toArray(new String[this.tagStack.size()]);

            // make use of apache-common-lang, or write your own code to concat 
            // the list with '.'s
            String tagHierarchy = StringUtils.join(tagArray, ".");
            this.dataMap.put(tagHierarchy, this.characters);

            // EDIT: I forgot to pop the last item off the stack :)
            this.tagStack.pop();
        } else {
            throw new SAXException("XML is not well-formed");
        }
    }

    public Map<String, String> getDataMap() {
        return this.dataMap;
    }

}

This would return a Map where using the input data described in the OP:

["company.staff.firstname", "Kevin"]
["company.staff.lastname", "Gay"]
["company.staff.salary", "50000"]

You can do your own tweaking if you don't want the full path to the element as the key such as Map where key is the tag name, and the value[0] is the parent path and value[1] is the actual value etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM