简体   繁体   中英

How to process large no of xml files and write into text file in java faster

I have millions of xml file in a day. Size of the xml files are from 10KB to 50 MB .

I have written SAX parser to parse xml files and write into text file . I am creating 35 unique text files from all millions of xml files . I have to parse these xml files in first come first basic manner so that order of the records is maintained .

I have to process the files very quickly.

Total size of the xml files will be approx 1 TB. I have not implemented multi thread to process xml files because i have to process it in first come first basis.

How to process all xml files real quick?

Before moving my code into prod i just wanted to check whether i need to rethink about my implementation.

This is how i read xml files and process it.

public static void main(String[] args) {
        File folder = new File("c://temp//SDIFILES");

        File[] files = folder.listFiles();

        Arrays.sort(files, new Comparator<Object>() {
            public int compare(Object o1, Object o2) {

                if (((File) o1).lastModified() > ((File) o2).lastModified()) {
                    return -1;
                } else if (((File) o1).lastModified() < ((File) o2).lastModified()) {
                    return +1;
                } else {
                    return 0;
                }
            }

        });

        for (File file : files) {
            System.out.println("Started Processing file :" + Arrays.asList(file));
            new MySaxParser(file);
        }

    }

I am not sure my processing will work for millions of the xml files.

As you said,you have to process it in first come first basis . You can think every xml file as a java method and then you can implement multi thread to process xml files.I think in this way you can save a lot of time .

Immediately:

return Long.compareTo(((File) o1).lastModified(), ((File) o2).lastModified());
  • read and write buffered
  • be careful of String operations
  • no validation
  • for DTDs use XML catalogs
  • use a profiler! (Saved me in Excel generation)
  • if possible use a database instead of 35 output files
  • check for a RAM disk or such
  • of course much memory -Xmx

The last resort, an XML pull parser (StaX) io Xalan/Xerces or plain text parsing, is what you try to prevent; so no comment on that.

Arrays.sort(files, new Comparator<File>() {
        @Override
        public int compare(File o1, File o2) {
            return Long.compareTo(o1.lastModified(), o2.lastModified());
        }
    });

There are number of things to consider...

  1. Is it a batch process when all files already there in c://temp//SDIFILES folder or is it a kind of event listener which is waiting for a next file to appear there?

  2. do you have XSD schemas for all those XMLs? If so you may think about to use JAXB unmarshaller upfront instead of custom SAX parser

IMHO from first glance...

  1. If it is a batch process - Separate parsing process from combining results into text files. Then you can apply multi-threading to parse files by using some temp/stage files/objects before put them into text files. ie

    • run as many parsing threads as your resources allow (memory/cpu)
    • place each parser result aside into temp file/DB/In memory Map etc. whith its order number or timestamp
    • combine ready results into text files as last step of the whole process. So you will not wait to parse next XML file only when previous parsed.
  2. if it is a listener it also can use multithreading to parse, but little more may needed. As example, spin up combining results into text files periodically (as example every 10 sec.) which will pick temp result files marked as ready

For both cases anyway it will be "portioned process". Let say you can run 5 parsing threads for the next 5 files from sorted by timestamp list of files, then wait until all 5 parsing threads completed (result may not be necessary a temp file, but can stay in memory if possible), then combine into text file. ... then pick next 5 files and so on...

... something like that...

Definitely, sequential process that large number of files will take a time, and mostly to parse them from XML.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM