How to read a big XML file in JAVA and Spilt it into small XML files Based on Tag?

Question

I am new to JAVA programming, now I in need of JAVA program to read a big XML file that containing .. tags. Sample input as follows.

Input.xml

<row>
<Name>Filename1</Name>
</row>
<row>
<Name>Filename2</Name>
</row>
<row>
<Name>Filename3</Name>
</row>
<row>
<Name>Filename4</Name>
</row>
<row>
<Name>Filename5</Name>
</row>
<row>
<Name>Filename6</Name>
</row>
 .
 .

I need output as first <row> </row> as a single .xml file with filename as filename1.xml and second <row>..</row> as filename2.xml and so.

Can anyone tell the steps how to do it in simple way with Java, it will be very useful if you give any sample codes ?

Answer 1

I can suggest using SAXParser and extending the DefaultHandler class' methods.
You can use a few boolean s to keep a track of which tag you are in.

DefaultHandler will let you know when you are in a particular tag by the startElement() method. Then, you will be given the contents of the tag by the characters() method and finally you will be notified of the end of a tag by the endElement() method.

As soon as you are notified of the end of a <row> , you can get the contents of the tag you just saved and create a file out of it.

Looking at your example, you just need a couple of boolean values -- boolean inRow and boolean inName so this should not be a hard task =)

Example from Mykong (I am leaving out the actual code, you must do it on your own. It is fairly trivial):

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class ReadXMLFile {

   public static void main(String argv[]) {

    try {

    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser saxParser = factory.newSAXParser();

    DefaultHandler handler = new DefaultHandler() {

    boolean bfname = false;
    boolean blname = false;
    boolean bnname = false;
    boolean bsalary = false;

    public void startElement(String uri, String localName,String qName, 
                Attributes attributes) throws SAXException {

        System.out.println("Start Element :" + qName);

        if (qName.equalsIgnoreCase("FIRSTNAME")) {
            bfname = true;
        }

        if (qName.equalsIgnoreCase("LASTNAME")) {
            blname = true;
        }

        if (qName.equalsIgnoreCase("NICKNAME")) {
            bnname = true;
        }

        if (qName.equalsIgnoreCase("SALARY")) {
            bsalary = true;
        }

    }

    public void endElement(String uri, String localName,
        String qName) throws SAXException {

        System.out.println("End Element :" + qName);

    }

    public void characters(char ch[], int start, int length) throws SAXException {

        if (bfname) {
            System.out.println("First Name : " + new String(ch, start, length));
            bfname = false;
        }

        if (blname) {
            System.out.println("Last Name : " + new String(ch, start, length));
            blname = false;
        }

        if (bnname) {
            System.out.println("Nick Name : " + new String(ch, start, length));
            bnname = false;
        }

        if (bsalary) {
            System.out.println("Salary : " + new String(ch, start, length));
            bsalary = false;
        }

    }

     };

       saxParser.parse("c:\\file.xml", handler);

     } catch (Exception e) {
       e.printStackTrace();
     }

   }

}

Answer 2

You could do the following with StAX because you said your xml is large

Code for Your Use Case

The following code uses StAX APIs to break up the document as outlined in your question:

 import java.io.*;
    import java.util.*;

    import javax.xml.namespace.QName;
    import javax.xml.stream.*;
    import javax.xml.stream.events.*;

    public class Demo {

        public static void main(String[] args) throws Exception  {
            Demo demo = new Demo();
            demo.split("src/forum7408938/input.xml", "nickname");
            //demo.split("src/forum7408938/input.xml", null);
        }

        private void split(String xmlResource, String condition) throws Exception {
            XMLEventFactory xef = XMLEventFactory.newFactory();
            XMLInputFactory xif = XMLInputFactory.newInstance();
            XMLEventReader xer = xif.createXMLEventReader(new FileReader(xmlResource));
            StartElement rootStartElement = xer.nextTag().asStartElement(); // Advance to statements element
            StartDocument startDocument = xef.createStartDocument();
            EndDocument endDocument = xef.createEndDocument();

            XMLOutputFactory xof = XMLOutputFactory.newFactory();
            while(xer.hasNext() && !xer.peek().isEndDocument()) {
                boolean metCondition;
                XMLEvent xmlEvent = xer.nextTag();
                if(!xmlEvent.isStartElement()) {
                    break;
                }
         // Be able to split XML file into n parts with x split elements(from
            // the dummy XML example staff is the split element).
            StartElement breakStartElement = xmlEvent.asStartElement();
            List<XMLEvent> cachedXMLEvents = new ArrayList<XMLEvent>();

            // BOUNTY CRITERIA
            // I'd like to be able to specify condition that must be in the 
            // split element i.e. I want only staff which have nickname, I want 
            // to discard those without nicknames. But be able to also split 
            // without conditions while running split without conditions.
            if(null == condition) {
                cachedXMLEvents.add(breakStartElement);
                metCondition = true;
            } else {
                cachedXMLEvents.add(breakStartElement);
                xmlEvent = xer.nextEvent();
                metCondition = false;
                while(!(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
                    cachedXMLEvents.add(xmlEvent);
                    if(xmlEvent.isStartElement() && xmlEvent.asStartElement().getName().getLocalPart().equals(condition)) {
                        metCondition = true;
                        break;
                    }
                    xmlEvent = xer.nextEvent();
                }
            }

            if(metCondition) {
                // Create a file for the fragment, the name is derived from the value of the id attribute
                FileWriter fileWriter = null;
                fileWriter = new FileWriter("src/forum7408938/" + breakStartElement.getAttributeByName(new QName("id")).getValue() + ".xml");

                // A StAX XMLEventWriter will be used to write the XML fragment
                XMLEventWriter xew = xof.createXMLEventWriter(fileWriter);
                xew.add(startDocument);

                // BOUNTY CRITERIA
                // The content of the spitted files should be wrapped in the 
                // root element from the original file(like in the dummy example
                // company)
                xew.add(rootStartElement);

                // Write the XMLEvents that were cached while when we were
                // checking the fragment to see if it matched our criteria.
                for(XMLEvent cachedEvent : cachedXMLEvents) {
                    xew.add(cachedEvent);
                }

                // Write the XMLEvents that we still need to parse from this
                // fragment
                xmlEvent = xer.nextEvent();
                while(xer.hasNext() && !(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
                    xew.add(xmlEvent);
                    xmlEvent = xer.nextEvent();
                }
                xew.add(xmlEvent);

                // Close everything we opened
                xew.add(xef.createEndElement(rootStartElement.getName(), null));
                xew.add(endDocument);
                fileWriter.close();
            }
        }
    }

}

Answer 3

The best approach is JAXB MArshal and unmarshaller to read and create xml fils.

Here is example

Answer 4

Assuming that your file have element that contains those rows:

<root>
    <row><Name>Filename1</Name></row>
    <row><Name>Filename2</Name></row>
    <row><Name>Filename3</Name></row>
    <row><Name>Filename4</Name></row>
    <row><Name>Filename5</Name></row>
    <row><Name>Filename6</Name></row>
</root>

This code will do the trick:

package com.example;

import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

public class Main {

    public static String readXmlFromFile(String fileName) throws Exception {
        BufferedReader reader = new BufferedReader(new FileReader(fileName));
        String line = null;
        StringBuilder stringBuilder = new StringBuilder();
        String lineSeparator = System.getProperty("line.separator");

        while ((line = reader.readLine()) != null) {
            stringBuilder.append(line);
            stringBuilder.append(lineSeparator);
        }

        return stringBuilder.toString();
    }

    public static List<String> divideXmlByTag(String xml, String tag) throws Exception {
        List<String> list = new ArrayList<String>();
        Document document = loadXmlDocument(xml);
        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
        NodeList rowList = document.getElementsByTagName(tag);
        for(int i=0; i<rowList.getLength(); i++) {
            Node rowNode = rowList.item(i);
            if (rowNode.getNodeType() == Node.ELEMENT_NODE) {
                DOMSource source = new DOMSource(rowNode);
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                StreamResult streamResult = new StreamResult(baos);
                transformer.transform(source, streamResult);
                list.add(baos.toString());
            }
        }
        return list;
    }

    private static Document loadXmlDocument(String xml) throws SAXException, IOException, ParserConfigurationException {
        return loadXmlDocument(new ByteArrayInputStream(xml.getBytes()));
    }

    private static Document loadXmlDocument(InputStream inputStream) throws SAXException, IOException, ParserConfigurationException {
        DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
        documentBuilderFactory.setNamespaceAware(true);
        DocumentBuilder documentBuilder = null;
        documentBuilder = documentBuilderFactory.newDocumentBuilder();
        Document document = documentBuilder.parse(inputStream);
        inputStream.close();
        return document;
    }

    public static void main(String[] args) throws Exception {
        String xmlString = readXmlFromFile("d:/test.xml");
        System.out.println("original xml:\n" + xmlString + "\n");
        System.out.println("divided xml:\n");
        List<String> dividedXmls = divideXmlByTag(xmlString, "row");
        for (String xmlPart : dividedXmls) {
            System.out.println(xmlPart + "\n");
        }

    }
}

You only need to write this xml parts to separates files.

Answer 5

Since the user requested one more solution posting other way.

use a StAX parser for this situation. It will prevent the entire document from being read into memory at one time.

Advance the XMLStreamReader to the local root element of the sub-fragment. You can then use the javax.xml.transform APIs to produce a new document from this XML fragment. This will advance the XMLStreamReader to the end of that fragment. Repeat step 1 for the next fragment.

Code Example

For the following XML, output each "statement" section into a file named after the "account attributes value":

<statements>
   <statement account="123">
      ...stuff...
   </statement>
   <statement account="456">
      ...stuff...
   </statement>

import java.io.File;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            File file = new File("out/" + xsr.getAttributeValue(null, "account") + ".xml");
            t.transform(new StAXSource(xsr), new StreamResult(file));
        }
    }

}

Answer 6

If you're new to Java then the people recommending SAX and StAX parsing are throwing you in at the deep end! This is pretty low-level stuff, highly efficient, but not designed for beginners. You said the file is "big" and they've all assumed that to mean "very big", but in my experience an unquantified "big" can mean anything from 1Mb to 20Gb, so designing a solution based on that description is somewhat premature.

It's much easier to do this with XSLT 2.0 than with Java. All it takes is a stylesheet like this:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="row">
  <xsl:result-document href="{FileName}">
    <xsl:copy-of select="."/>
  </xsl:result-document>
</xsl:template>
</xsl:stylesheet>

And if it has to be within a Java application, you can easily invoke the transformation from Java using an API.

Answer 7

Try out this,

import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import javax.xml.transform.*; 
import javax.xml.transform.dom.DOMSource; 
import javax.xml.transform.stream.StreamResult;

public class Test{
 static public void main(String[] arg) throws Exception{

 DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
 DocumentBuilder builder = factory.newDocumentBuilder();
 Document doc = builder.parse("foo.xml");

 TransformerFactory tranFactory = TransformerFactory.newInstance(); 
 Transformer aTransformer = tranFactory.newTransformer(); 


 NodeList list = doc.getFirstChild().getChildNodes();

 for (int i=0; i<list.getLength(); i++){
    Node element = list.item(i).cloneNode(true);

 if(element.hasChildNodes()){
   Source src = new DOMSource(element); 
   FileOutputStream fs=new FileOutputStream("k" + i + ".xml");
   Result dest = new StreamResult(fs);
   aTransformer.transform(src, dest);
   fs.close();
   }
   }

  }
}

Source: Related Answer

How to read a big XML file in JAVA and Spilt it into small XML files Based on Tag?

Question

7 answers

solution1
3 2013-11-22 11:15:50

solution2
3 2013-11-22 11:18:11

solution3
1 2013-11-22 11:10:43

solution4
1 2013-11-22 11:53:55

solution5
1 2013-11-22 12:07:05

solution6
1 2013-11-22 17:27:00

solution7
0 2013-11-22 11:22:37

How to read a big XML file in JAVA and Spilt it into small XML files Based on Tag?

Question

7 answers

solution1 3 2013-11-22 11:15:50

solution2 3 2013-11-22 11:18:11

solution3 1 2013-11-22 11:10:43

solution4 1 2013-11-22 11:53:55

solution5 1 2013-11-22 12:07:05

solution6 1 2013-11-22 17:27:00

solution7 0 2013-11-22 11:22:37

solution1
3 2013-11-22 11:15:50

solution2
3 2013-11-22 11:18:11

solution3
1 2013-11-22 11:10:43

solution4
1 2013-11-22 11:53:55

solution5
1 2013-11-22 12:07:05

solution6
1 2013-11-22 17:27:00

solution7
0 2013-11-22 11:22:37