简体   繁体   中英

STAX to modify a nested XML with conditions and write back for Huge XML- around 6 GB

I am a beginner to java and have been working on a code to unzip a zip file containing around 100 000 XML files and then merging those files into 1 XML file so that I can process one file instead of loading these many files. I unzipped the file and merged it into 1 file and parsed using DOM parser but now I need to modify this Merged XML file and then write it back in 1 file. I am able to do it using DOM parser and StringBuilder but looks like StringBuilder is not able to handle this big file as it gives java heap space error.

As I did further research I understood STAX parser might be a good fit to handle large files with better performance.

I have been going through multiple articles and tutorials but could not yet managed to write the code which can suffice my requirement. So, my XML has multiple tags, after merging I have a structure something like this:

<Items>
   <Item >
      <Tag1>
</Tag1>
         <Tag2>
</Tag2>
            <Images>
               <Image>
                  <width>200</width>
                  <height>200</height>
                  <url>xyz.com</url>
                  <action>update</action>
               </Image>
               <Image>
                  <width>400</width>
                  <height>600</height>
                  <url>xyz.com</url>
                  <action>update</action>
               </Image>
            </Images>
   </Item>
   <Item >
      <Tag1>
</Tag1>
         <Tag2>
</Tag2>
            <Images>
               <Image>
                  <width>100</width>
                  <height>400</height>
                  <url>abc.com</url>
                  <action>update</action>
               </Image>
               <Image>
                  <width>400</width>
                  <height>200</height>
                  <url>xyz.com</url>
                  <action>update</action>
               </Image>
            </Images>
   </Item>
</Items>

My requirement is to check if width and height under Image tag is greater than some value then only take Image Tag else delete it from Images section. Same way some other tags I will need to remove from the file and once all the processing is done return the whole XML file back with changes.

I went through many articles for STAX implementation but I could not get myself clear on how to access Image tag which is kind of great grandchild of root tag <Items>.

It's very easy using XSLT 3.0 streaming:

<xsl:mode streamable="yes" on-no-match="shallow-copy"/>
<xsl:template match="Image">
  <xsl:variable name="image" select="copy-of(.)"/>
  <xsl:sequence select="$image[width*height gt 100000]"/>
</xsl:template>

Of course you could use any predicate here on width and height. The stylesheet outputs the image if and only if the predicate is true.

However, if you've got 6Gb of XML distributed over 100K files then I'm not sure why you're concatenating them before processing. That's going to be much more memory-intensive than processing them individually (which can also be done with a few lines of XSLT code).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM