简体   繁体   中英

Sorting a 100MB XML file with Java?

How long does sorting a 100MB XML file with Java take ?

The file has items with the following structure and I need to sort them by event

<doc>
    <id>84141123</id>
    <title>kk+ at Hippie Camp</title>
    <description>photo by SFP</description>
    <time>18945840</time>
    <tags>elphinstone tribalharmonix vancouver intention intention7 newyears hippiecamp bc sunshinecoast woowoo kk kriskrug sunglasses smoking unibomber møtleykrüg </tags>
    <geo></geo>
    <event>47409</event>
</doc>

I'm on a Intel Dual Duo Core and 4GB RAM.

Minutes ? Hours ?

thanks

Here are the timings for a similar task executed using Saxon XQuery on a 100Mb input file.

Saxon-EE 9.3.0.4J from Saxonica
Java version 1.6.0_20
Analyzing query from {for $i in //item order by location return $i}
Analysis time: 195 milliseconds
Processing file:/e:/javalib/xmark/xmark100.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/e:/javalib/xmark/xmark100.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 6158 milliseconds
Tree size: 4787932 nodes, 79425460 characters, 381878 attributes
Execution time: 3.466s (3466ms)
Memory used: 471679816

So: about 6 seconds for parsing the input file and building a tree, 3.5 seconds for sorting it. That's invoked from the command line, but invoking it from Java will get very similar performance. Don't try to code the sort yourself - it's only a one-line query, and you are very unlikely to match the performance of an optimized XQuery engine.

我会说分钟-您应该能够完全在内存中执行此操作,因此使用萨克斯解析器进行读取,排序和写入操作,对于您的硬件来说应该不是问题

I think a problem like this would be better sorted using serialisation.

  1. Deserialise the XML file into an ArrayList of 'doc'.

  2. Using straight Java code, apply sort on the event attribute and stored sorted arraylist in another variable.

  3. Serialise out the sorted 'doc' ArrayList to file

If you do it in memory, you should be able to do this in under 10 seconds. You would be pusshing to do this under 2 seconds because it will spend that much times reading/writing to disk.

This program should use no more than 4-5x times the original file size. about 500 MB in your case.

String[] records = FileUtils.readFileToString(new File("my-file.xml")).split("</?doc>");
Map<Long, String> recordMap = new TreeMap<Long, String>();
for(int i=1;i<records.length;i+=2) {
    String record = records[i];
    int pos1 = record.indexOf("<id>");
    int pos2 = record.indexOf("</id>", pos1+4);
    long num = Long.parseLong(record.substring(pos1+3, pos2));
    recordMap.put(num, record);
}

StringBuilder sb = new StringBuilder(records[0]);
for (String s : recordMap.values()) {
    sb.append("<doc>").append(s).append("</doc>");
}
sb.append(records[records.length-1]);
FileUtils.writeStringToFile(new File("my-output-file.xml"), sb.toString());

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM