简体   繁体   English

使用Java排序100MB XML文件?

[英]Sorting a 100MB XML file with Java?

How long does sorting a 100MB XML file with Java take ? 用Java排序100MB XML文件需要多长时间?

The file has items with the following structure and I need to sort them by event 该文件具有以下结构的项目,我需要按事件对它们进行排序

<doc>
    <id>84141123</id>
    <title>kk+ at Hippie Camp</title>
    <description>photo by SFP</description>
    <time>18945840</time>
    <tags>elphinstone tribalharmonix vancouver intention intention7 newyears hippiecamp bc sunshinecoast woowoo kk kriskrug sunglasses smoking unibomber møtleykrüg </tags>
    <geo></geo>
    <event>47409</event>
</doc>

I'm on a Intel Dual Duo Core and 4GB RAM. 我正在使用Intel Dual Duo Core和4GB RAM。

Minutes ? 分钟 ? Hours ? 小时 ?

thanks 谢谢

Here are the timings for a similar task executed using Saxon XQuery on a 100Mb input file. 以下是使用Saxon XQuery在100Mb输入文件上执行类似任务的时间。

Saxon-EE 9.3.0.4J from Saxonica
Java version 1.6.0_20
Analyzing query from {for $i in //item order by location return $i}
Analysis time: 195 milliseconds
Processing file:/e:/javalib/xmark/xmark100.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/e:/javalib/xmark/xmark100.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 6158 milliseconds
Tree size: 4787932 nodes, 79425460 characters, 381878 attributes
Execution time: 3.466s (3466ms)
Memory used: 471679816

So: about 6 seconds for parsing the input file and building a tree, 3.5 seconds for sorting it. 因此:解析输入文件并构建树大约需要6秒钟,排序文件大约需要3.5秒。 That's invoked from the command line, but invoking it from Java will get very similar performance. 这是从命令行调用的,但是从Java调用它会获得非常相似的性能。 Don't try to code the sort yourself - it's only a one-line query, and you are very unlikely to match the performance of an optimized XQuery engine. 不要尝试自己编写排序代码-这只是一个单行查询,因此您不太可能匹配优化的XQuery引擎的性能。

我会说分钟-您应该能够完全在内存中执行此操作,因此使用萨克斯解析器进行读取,排序和写入操作,对于您的硬件来说应该不是问题

I think a problem like this would be better sorted using serialisation. 我认为使用串行化可以更好地解决此类问题。

  1. Deserialise the XML file into an ArrayList of 'doc'. 将XML文件反序列化为“ doc”的ArrayList。

  2. Using straight Java code, apply sort on the event attribute and stored sorted arraylist in another variable. 使用简单的Java代码,对事件属性应用排序,并将排序后的数组列表存储在另一个变量中。

  3. Serialise out the sorted 'doc' ArrayList to file 序列化排序后的“ doc” ArrayList到文件

If you do it in memory, you should be able to do this in under 10 seconds. 如果您在内存中执行此操作,则应该可以在10秒内完成此操作。 You would be pusshing to do this under 2 seconds because it will spend that much times reading/writing to disk. 您可能会希望在2秒内完成此操作,因为这将花费大量时间读取/写入磁盘。

This program should use no more than 4-5x times the original file size. 该程序使用的文件大小不得超过原始文件大小的4-5倍。 about 500 MB in your case. 您需要大约500 MB。

String[] records = FileUtils.readFileToString(new File("my-file.xml")).split("</?doc>");
Map<Long, String> recordMap = new TreeMap<Long, String>();
for(int i=1;i<records.length;i+=2) {
    String record = records[i];
    int pos1 = record.indexOf("<id>");
    int pos2 = record.indexOf("</id>", pos1+4);
    long num = Long.parseLong(record.substring(pos1+3, pos2));
    recordMap.put(num, record);
}

StringBuilder sb = new StringBuilder(records[0]);
for (String s : recordMap.values()) {
    sb.append("<doc>").append(s).append("</doc>");
}
sb.append(records[records.length-1]);
FileUtils.writeStringToFile(new File("my-output-file.xml"), sb.toString());

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM