简体   繁体   中英

Huge XML file to text files

I have a huge XML file(15 GB). I want to convert a 'text' tag in XML file to a single page.

Sample XML file:

<root>
    <page>
        <id> 1 </id>
        <text>
        .... 1000 to 50000 lines of text
        </text>
    </page>
    ... Like wise 2 Million `page` tags
</root>

I've initially used DOM parser, but it throws JAVA OUT OF MEMORY(Valid). Now, I've written JAVA code using STAX. It works good, but performance is really slow.

This is the code I've written:

 XMLEventReader xMLEventReader = XMLInputFactory.newInstance().createXMLEventReader(new FileInputStream(filePath));
    while(xMLEventReader.hasNext()){
      xmlEvent = xMLEventReader.nextEvent();

    switch(xmlEvent.getEventType()){
    case XMLStreamConstants.START_ELEMENT:
    if( element == "text")
      isText    = true;
    break;
    case XMLStreamConstants.CHARACTERS:
      chars = (Characters) xmlEvent;
      if(! (chars.isWhiteSpace() || chars.isIgnorableWhiteSpace()))
               if(isText)
              pageContent += chars.getData() + '\n';
      break;
    case XMLStreamConstants.END_ELEMENT:
      String elementEnd = (((EndElement) xmlEvent).getName()).getLocalPart();
      if( elementEnd == "text" )
      {
          createFile(id, pageContent);
          pageContent = "";
          isText = false;
      }
      break;
    }
}

This code is working good.(Ignore about any minor errors). According to my understanding, XMLStreamConstants.CHARACTERS iterates for each and everyline of text tag. If TEXT tag has 10000 lines in it, XMLStreamConstants.CHARACTERS iterates for next 10000 lines. Is there any better way to improve the performance..?

I can see a few possible solutions things that might help you out:

  1. Use a BufferedInputStream rather than a simple FileInputStream to reduce the number of disk operations
  2. Consider using a StringBuilder to create your pageContent rather than String catenation.
  3. Increase your Java heap ( -Xmx option) in case you're memory bound with your 2GB example.

It can be quite interesting in cases like this to hook up a code profiler (eg Java VisualVM ) as you are then able to see exactly what method calls are being slow within your code. You can then focus optimisations appropriately.

If parsing of XML file is the main issue, consider using VTD-XML , namely the extended version as it supports files up to 256GB.

As it is based on non-extractive document parsing, it is quite memory efficient and using it to querying/extract text using XPath is also very fast. You can read more details about this approach and VTD-XML from here .

Try to parse with SAX parser because DOM will try to parse the entire content and place it in memory. Because of this you are getting Memory exception. SAX parser will not parse the entire content at one stretch.

What is pageContent ? It appears to be a String . One easy optimization to make right away would be to use a StringBuilder instead; it can append strings without having to make completely new copies of the strings like String s += does (you can also construct it with an initial reserved capacity to reduce memory reallocations and copies if you have an idea of the length to begin with).

Concatenating String s is a slow operation because strings are immutable in Java; each time you call a += b it must allocate a new string, copy a into it, then copy b into the end of it; making each concatenation O(n) wrt. total length of the two strings. Same goes for appending single characters. StringBuilder on the other hand has the same performance characteristics as an ArrayList when appending. So where you have:

pageContent += chars.getData() + '\n';

Instead change pageContent to a StringBuilder and do:

pageContent.append(chars.getData()).append('\n');

Also if you have a guess on the upper bound of the length of one of these strings, you can pass it to the StringBuilder constructor to allocate an initial amount of capacity and reduce the chance of a memory reallocation and full copy having to be done.

Another option, by the way, is to skip the StringBuilder altogether and write your data directly to your output file (presuming you're not processing the data somehow first). If you do this, and performance is I/O-bound, choosing an output file on a different physical disk can help.

You code looks standard. However, could you try wrapping your FileInputStream into a BufferedInputStream and let us know if that helps? BufferedInputstream saves you few native calls to the OS, so there are chances of better performance. You have to play around with the Buffer size to get the optimum performance. Set some size depending on your JVM memory allocation.

  1. Use a BufferedInputStream around the FileInputStream.
  2. Don't concatenate the data. It's a complete waste of time and space, potentially a lot of space. Write it out immediately you get it. Use a BufferedWriter around a FileWriter for that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM