简体   繁体   中英

XML into data structure in java using sax, stax or DOM

So I've been working on this project of mine for the past two weeks and I've not made any headway with this. My issue isn't with parsing the XML file to begin with, but rather what to do with it afterwards. So I've made programs with SAX, StAX and DOM parsers in which I take a very large XML file and then print out the elements and their values in order. However, the XML I'm dealing with is large so using DOM is inefficient of course. Another problem I have however is that the xml file has 40,000 entries of information and its structure is complicated. This is a little excerpt of it:

<metabolite>
  <version>3.5</version>
  <creation_date>2005-11-16 08:48:42 -0700</creation_date>
  <update_date>2013-02-08 17:07:44 -0700</update_date>
  <accession>HMDB00002</accession>
  <secondary_accessions>
  </secondary_accessions>
  <name>1,3-Diaminopropane</name>
  <description>1,3-Diaminopropane is a stable, flammable and highly hydroscopic fluid. It is a polyamine that is normally quite toxic if swallowed, inhaled or absorbed through the skin. It is a catabolic byproduct of spermidine. It is also a precursor in the enzymatic synthesis of beta-alanine. 1, 3-Diaminopropane is involved in the arginine/proline metabolic pathways and the beta-alanine metabolic pathway.</description>
  <synonyms>
    <synonym>1,3-Diamino-N-propane</synonym>
    <synonym>1,3-Propanediamine</synonym>
    <synonym>1,3-Propylenediamine</synonym>
    <synonym>1,3-Trimethylenediamine</synonym>
    <synonym>3-Aminopropylamine</synonym>
    <synonym>a,w-Propanediamine</synonym>
    <synonym>Propane-1,3-diamine</synonym>
    <synonym>Trimethylenediamine</synonym>
  </synonyms>
  <chemical_formula>C3H10N2</chemical_formula>

So this one of 40 entries, and it contains many more elements etc in it. What I need to be able to do with my program is allow the user to select what information he wants from the 40,000 entry, and then return the information in the form of an excel sheet. So if I only wanted say the version number and name for all 40,000 entries, it'll return just those values into excel. Currently I've made a program that loops through using StAX and returns all the elements and values through print onto console. How would I go about creating a data structure, such as a tree or something, that would then allow me to do what it is that I want to do (ie traverse through that data and return only the data I'm seeking).

This is what I've done so far in terms of looping through my document and returning the information in order for the 40,000 entries:

public class xmlRead {

    private static XMLStreamReader reader;

    public xmlRead(){

        try{

            InputStream file = new FileInputStream("/Users/Kevlar/Dropbox/PhD/Java/HMDB/testOutput.xml");
            XMLInputFactory inputFactory = XMLInputFactory.newInstance();

            reader = inputFactory.createXMLStreamReader(file);

            assert(reader.getEventType() == XMLEvent.START_DOCUMENT);   

        }   catch (XMLStreamException e){
            System.err.println("XMLStreamException : " + e.getMessage());

        }   catch (FactoryConfigurationError e){
            System.err.println("FactoryConfigurationError : " + e.getMessage());

        }   catch (FileNotFoundException e){
            System.err.println("FileNotFoundException : " + e.getMessage());

        }
    }

    public void metaboliteInfo() throws XMLStreamException{

        while(reader.hasNext()){

        int event = reader.getEventType();

        if(event == XMLStreamConstants.START_ELEMENT && reader.getLocalName() == "metabolite"){

            System.out.println("New " + reader.getLocalName());     
            mainElements(reader);
        }

        else if(event == XMLStreamConstants.END_DOCUMENT){
            System.out.println("end of document");
            break;

        }

        else{

        reader.next();

        }

        }

        reader.close();
    }


    public void mainElements(XMLStreamReader reader) throws XMLStreamException{

            int level = 1;

            do{

                int event = reader.next();

                if(event == XMLStreamConstants.START_ELEMENT){

                    System.out.println("Element :" + reader.getLocalName());
                    level++;

                    if(level == 2){
                        subElements(reader);
                        level--;
                    }
                }

                else if(event == XMLStreamConstants.CHARACTERS && !reader.isWhiteSpace()){
                    System.out.println(reader.getText());
                }

                else if(event == XMLStreamConstants.END_ELEMENT){
                    level--;
                }

            }while(level > 0);

        reader.close();

    }

    private void subElements(XMLStreamReader reader) throws XMLStreamException {

        int level = 1;

        do{

            int event = reader.next();

            if(event == XMLStreamConstants.START_ELEMENT){

                System.out.println("Sub element :" + reader.getLocalName());
                level++;

                if(level == 2){
                    subElements(reader);
                    level--;
                }
            }

            else if(event == XMLStreamConstants.CHARACTERS && !reader.isWhiteSpace()){
                System.out.println(reader.getText());
            }

            else if(event == XMLStreamConstants.END_ELEMENT){
                level--;
            }

        }while(level > 0);

    reader.close();

}

    public void findElements(XMLStreamReader reader, String element) throws XMLStreamException{

            int level = 1;

            do{

                int event = reader.next();

                if(event == XMLStreamConstants.START_ELEMENT){

                    if(reader.getLocalName() == element){
                        System.out.println(reader.getLocalName());
                    }
                    level++;

                    if(level == 2){
                        subElements(reader);
                        level--;
                    }
                }

                else if(event == XMLStreamConstants.CHARACTERS && !reader.isWhiteSpace()){
                    System.out.println(reader.getText());
                }

                else if(event == XMLStreamConstants.END_ELEMENT){
                    level--;
                }

            }while(level > 0);

        reader.close();

    }


    public static void main(String[] args) throws XMLStreamException{

        xmlRead test = new xmlRead();
        test.metaboliteInfo();

    }

}

I should probably note here too that I'm not actually a programmer. I just have to deal with these XML files for the purpose of my research but don't have anyone else to do it for me so my knowledge about java is limited I'm afraid (ie explaining things in layman terms would be great).

Look up JAXB. This is a framework for converting XML to java code and vice versa. If you use JXB to auto generate your java classes for you, you don't need to worry about hand rolling your own data structure.

You'll need to start off with an XML schema, which defines what your XML file is allowed to look like. If you don't have one already, you can create an XML Schema Definition (XSD) file from the XML file, by using a tool such as XMLSpy. JAXB provides a tool called xjc. This can be used to generate Java classes automatically from an XML schema. Where your XML has repeating tags, these java classes contain collections that can be iterated over.

XQuery solution. Using this exrpression you can filter input xml document:

declare function local:rewrite($node as node()) as node()?
{
    typeswitch ($node)
    case element() return
        if (matches(local-name($node), "(version|name|synonym)")) then
            element {node-name($node)}
            {
                $node/@*,
                for $child in $node/node() return local:rewrite($child)
            }
        else
            ()
    default return
        $node
};

for $m in //metabolite
return <metabolite>{for $c in $m/node() return local:rewrite($c)}</metabolite>

Replace (version|name|synonym) with regexp that matches xml node names you need to provide. Java 7 code that evaluates XQuery expression:

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamResult;
import net.sf.saxon.Configuration;
import net.sf.saxon.om.DocumentInfo;
import net.sf.saxon.query.DynamicQueryContext;
import net.sf.saxon.query.StaticQueryContext;
import net.sf.saxon.query.XQueryExpression;
import org.xml.sax.InputSource;
// inside a method
Configuration config = new Configuration();
StaticQueryContext sqc = config.newStaticQueryContext();
DynamicQueryContext dqc = new DynamicQueryContext(config);
String xq = "XQUERY_EXPRESSION";
try (InputStream xmlFileInput = new FileInputStream("data.xml");
        OutputStream xmlFileOutput = new FileOutputStream("data-filtered.xml")) {
    XQueryExpression expression = sqc.compileQuery(xq);
    SAXSource source = new SAXSource(new InputSource(xmlFileInput));
    DocumentInfo di = config.buildDocument(source);
    dqc.setContextItem(di);
    expression.run(dqc, new StreamResult(xmlFileOutput), null);
} catch (Exception e) {
    System.err.println(e.getMessage());
}

Saxon (eg saxon9he.jar) library must be present in classpath in order to compile and run this code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM