简体繁体 English

更快的多重解析：SAX或DOM

[英]faster multiple parsings : SAX or DOM

原文 2013-08-31 04:41:04 7 1 java/ xml/ parsing/ dom/ xml-parsing

I read many posts that SAX is faster than DOM. 我读过很多文章说SAX比DOM快。 I am not sure if my question is silly but i think DOM must be faster if we have huge memory.Cause once the tree structure is loaded into memory then it should be faster than SAX. 我不确定我的问题是否愚蠢，但我认为如果我们有大量内存，DOM必须更快。因为一旦将树结构加载到内存中，则它应该比SAX更快。

I need some clarifications here, please help me in understanding. 我需要在这里进行说明，请帮助我理解。 I have a use case where i receive a huge file to parse multiple times everyday. 我有一个用例，每天收到一个巨大的文件要解析多次。 Can i say DOM might be bit slower than SAX while parsing for the first time, and all subsequent parsings will be tremendously faster in case of DOM as it loads the entire document structure in memory and reuses it. 我可以说DOM在第一次解析时可能会比SAX慢，而在DOM的情况下，由于它将整个文档结构加载到内存中并重复使用，因此所有后续解析都将大大加快。 If so , then how can we say that SAX is faster than DOM .Please correct me if i am wrong. 如果是这样，那我们怎么可以说SAX比DOM快。如果我错了，请纠正我。 And if tomorrow i change my XSD and need to push the new structure into memory then is there any way to do it without restarting the application. 如果明天我更改XSD并需要将新结构推入内存，那么有任何方法可以执行此操作而无需重新启动应用程序。

1 个解决方案

We use SAX when: 在以下情况下，我们使用SAX：

We are damn sure that only a single pass over the file will suffice. 我们确定该文件只需传递一次即可。 which by the way does for most of the times. 顺便说一句，在大多数情况下都可以做到。 a code which does multi pass or takes pointer back/forward can most of the times be refactored to work in one pass. 在大多数情况下，执行多次通过或使指针后退/前进的代码可以重构为一次通过。
When we are receiving the xml file through some streaming channel, like over network for example, and we want to do real time readout possibly even before the whole file has completely downloaded. 当我们通过某种流媒体渠道（例如通过网络）接收xml文件时，我们甚至希望在完整下载整个文件之前进行实时读取。 SAX can work with partially downloaded files, DOM cannot. SAX可以处理部分下载的文件，而DOM不能。
When we are interested in a particular locality within the XML, not in complete document. 当我们对XML中的特定位置感兴趣时，而不对完整文档感兴趣。 for example an Atom Feed works best with SAX, but to analyze a WSDL you will need a DOM. 例如，Atom Feed与SAX配合使用效果最好，但是要分析WSDL，您将需要一个DOM。

We use DOM when: 我们在以下情况下使用DOM：

Well when single pass will not do. 好吧，单程通行不行。 we need to go up and down in the file. 我们需要在文件中上下移动。
when the XML is on disk and we dont need real-time readouts. 当XML在磁盘上并且我们不需要实时读取时。 we can take our time, load it, read it, analyze it, then come to conclusion. 我们可以花点时间，加载，阅读，分析，然后得出结论。
When your boss asks to do it before lunch and you dont bother the quality. 当您的老板要求在午餐前这样做时，您就不用担心质量了。

now to answer your question 现在回答您的问题

you provided with: 您提供了：

you have a huge file : ........SAX +1 您有一个大文件：........ SAX +1
to parse multiple times : .....DOM +1 解析多次：..... DOM +1

both get equal votes. 都获得平等的选票。 Add to it your existing knowledge base. 添加到您现有的知识库。 (Familiar with SAX?). （熟悉SAX吗？）。 How huge is huge? 巨大有多大？ Both of your XML and memory you said is huge. 您说的XML和内存都很大。 even a 100MB file is not a big deal. 即使是100MB的文件也没什么大不了的。 DOM can handle it. DOM可以处理它。 You need to parse multiple times each day. 您每天需要解析多次。 if one operation takes within a couple of minutes, then retaining the data in memory for next few hours doesnt seem wise. 如果一项操作需要花费几分钟，那么将数据保留在内存中接下来的几个小时似乎并不明智。 in that case you loose benefit of DOM. 在这种情况下，您将失去DOM的优势。 but if one operation itself takes say an hour then you are damn right to retain the pre-processed information. 但是，如果一项操作本身要花费一个小时的时间，那您该保留预处理信息的权利是对的。

As i noted you didnt provide enough stats. 如我所述，您没有提供足够的统计信息。 take the stats on data size, memory size, time-to-load in-DOM, processing time, exactly how many times a day do you need it again? 记录一下数据大小，内存大小，DOM中的加载时间，处理时间等统计信息，您每天究竟需要多少次？ what does your machien do in meantime? 您的机械师同时做什么？ sit idle or analyzes other such files? 闲置还是分析其他此类文件？

takes these stats. 获取这些统计信息。 either post it here or just analyze them yourself and you will reach a conclusion. 要么在这里发布，要么自己分析一下，您将得出结论。