Java Sax解析复杂的大型XML文件

Question

I am using SAX to parse some large XML files and I want to ask the following: The XML files have a complex structure. 我正在使用SAX解析一些大的XML文件，并且我想问以下问题：XML文件具有复杂的结构。 Something like the following: 类似于以下内容：

<library>
    <books>
    <book>
        <title></title>
    <img>
        <name></name>
        <url></url>
    </img>
    ...
    ...
    </book>
    ...
    ...
</books>
<categories>
    <category id="abcd">
        <locations>
        <location>...</location>
    </locations>
    <url>...</url>
    </category>
    ...
    ... 
</categories>
<name>...</name>
<url>...</url>
</library>

The fact is that these files are over 50MB each and a lot of tags are repeated under different context, eg url under /books/book/img but also under /library and under /library/categories/category and so on. 事实是，这些文件每个都超过50MB，并且在不同的上下文中重复了很多标签，例如，/ books / book / img下的url，以及/ library下和/ library / categories / category下的url等。

My SAX parser uses a subclass of DefaultHandler in which I override teh startElement and the endElement methods (among others). 我的SAX解析器使用DefaultHandler的子类，在其中我重写了startElement和endElement方法（以及其他方法）。 But the problem is that these methods are huge in terms of lines of code due to the business logic of these XML files. 但是问题在于，由于这些XML文件的业务逻辑，这些方法在代码行方面非常庞大。 I am using a lot of 我正在用很多

if ("url".equalsIgnoreCase(qName)) {
    // peek at stack and if book is on top
    // ...
    // else if category is on top
    // ...
} else if (....) {
}

I was wondering whether there is a more proper / correct / elegant way to perform the xml parsing. 我想知道是否有更正确/正确/优雅的方法来执行xml解析。

Thank you all 谢谢你们

Answer 1

What you can do is implement separate ContentHandler for different contexts. 您可以做的是为不同的上下文实现单独的 ContentHandler 。 For example write one for <books> , one for <categories> and one top-level one. 例如，为<books>写一<books> ，为<categories>写一<books> ，为顶层一本。

Then, as soon as the books startElement method is called, you immediately switch the ContentHandler using XMLReader.setContentHandler() . 然后，一旦调用books startElement方法，就立即使用XMLReader.setContentHandler()切换ContentHandler 。 Then the <books> specific ContentHandler switches back to the top-level handler to when its endElement method is called for books . 然后将<books>特定ContentHandler切换回顶层处理程序当其到endElement方法被称为用于books 。

This way each ContentHandler can focus on his particular part of the XML and need not know about all the other parts. 这样，每个ContentHandler可以专注于XML的特定部分，而无需了解所有其他部分。

The only ugly-ish part is that the specific handlers need to know of the top-level handler and when to switch back to it, which can be worked around by providing a simple "handler stack" that handles that for you. 唯一丑陋的部分是特定的处理程序需要了解顶级处理程序以及何时切换回该处理程序，可以通过提供一个简单的“处理程序堆栈”来为您解决该问题。

Answer 2

Not sure whether you're asking 1) is there something else you can do besides checking the tag against a bunch of strings or 2) if there's an alternative to a long if-then-else kind of statement. 不知道您要问的是1）除了针对一堆字符串检查标记之外，还可以执行其他操作吗？2）是否可以使用长的if-then-else语句替代方法。

The answer to 1 is not that I've found. 1的答案不是我找到的。 Someone else may tackle that one. 其他人可能会解决这个问题。

The answer to 2 depends on your domain. 2的答案取决于您的域。 One way I see is that if the point of this is to hydrate a bunch of objects from an XML file, then you can use a factory method. 我看到的一种方法是，如果这样做的目的是从XML文件中合并一堆对象，则可以使用工厂方法。

So the first factory method has the long if then else statement that simply passes off the XML to the appropriate classes. 因此，第一个工厂方法具有长的if if else语句，该语句仅将XML传递给适当的类。 Then each of your classes has a method like constructYourselfFromXmlString. 然后，您的每个类都有一个类似ConstructYourselfFromXmlString的方法。 This will improve your design because only the objects themselves know about the private data that is in an XML to hydrate them. 这将改善您的设计，因为只有对象本身才知道XML中的私人数据，以使它们水合。

the reason this is hard is that, if you think about it, exporting an Object to XML and importing back in really violates encapsulation. 很难做到这一点的原因是，如果考虑到这一点，将Object导出为XML并导入回来确实违反了封装。 Nothing to be done about it, just is. 没事做，就是这样。 This at least makes things a little more encapsulated. 这至少使事情变得更加封装。

HTH 高温超导

Answer 3

Agreeing with the sentiment that exporting an object to XML is a violation of encapsulation, the actual technique used to handle tags which are nested at different lengths isn't terribly difficult using SAX. 同意将对象导出为XML违反封装的观点，使用SAX来处理嵌套在不同长度的标签的实际技术并不困难。

Basically, keep a StringBuffer which will maintain your "location" in the document, which will be a directory like representation of the nested tag you are currently within. 基本上，保留一个StringBuffer，它将在文档中维护您的“位置”，该目录将是类似于您当前所在的嵌套标签的目录。 For example, if at the moment the string buffer's contents are /library/book/img/url then you know it's an URL for an image in a book, and not a URL for some category. 例如，如果当前字符串缓冲区的内容是/library/book/img/url那么您知道它是一本书中图像的URL，而不是某个类别的URL。

Once you ensure that your "path tracking" algorithms are correct you can then wrap your object creation routines with better handling by using string matches. 一旦确保“路径跟踪”算法正确无误，便可以使用字符串匹配将对象创建例程包装起来，从而获得更好的处理。 Instead of 代替

if ("url".equalsIgnoreCase(qName)) {
   ...
}

you can now substitute 您现在可以替代

if (location.equalsIgnoreCase("/library/book/img/url")) {
   ...
}

If for some reason this doesn't appeal to you, there are still other solutions. 如果由于某种原因这对您没有吸引力，那么还有其他解决方案。 For example, you can make a SAX handler which implements a stack of Handlers where the top handler is responsible for handling just it's portion of the XML document, and it pops itself off the stack once it is done. 例如，您可以制作一个SAX处理程序，该处理程序实现一堆处理程序，其中顶级处理程序仅负责处理XML文档的一部分，并在处理完成后自动将其弹出堆栈。 Using such a scheme, the each object gets created by its own unique individual handler, and some handlers basically check and direct which "object creation" handlers get shoved onto the handling stack at the appropriate times. 使用这种方案，每个对象都由其自己唯一的个体处理程序创建，并且某些处理程序基本上会检查并指示在适当的时间将哪些“对象创建”处理程序推入处理堆栈。

I've used both techniques. 我已经使用了两种技术。 There are strengths in both, and which one is best really depends on the input and the needed objects. 两者都有优势，哪种才是最佳取决于您的输入和所需的对象。

Answer 4

You could refactor your SAX content handling so that you register a set of rules, each of which has a test that it applies to see if it matches the element, and an action that is executed if it does. 您可以重构SAX内容处理，以便注册一组规则，每个规则都有一个适用于它的测试，以查看它是否与元素匹配，如果符合则执行一个操作。 This is moving closer to the XSLT processing model, while still doing streamed processing. 这将更接近XSLT处理模型，同时仍在进行流处理。 Or you could move to XSLT - processing 50Mb input files is well within the capabilities of a modern XSLT processor. 或者，您可以转向XSLT-处理50Mb输入文件完全在现代XSLT处理器的能力范围内。

Answer 5

尝试使SAX-JAVA绑定更容易

Java Sax解析复杂的大型XML文件

问题描述

5 个解决方案

解决方案1
1 2011-10-24 14:57:54

解决方案2
0 已采纳 2011-10-24 13:48:03

解决方案3
0 2011-10-24 14:30:05

解决方案4
0 2011-10-24 17:54:41

解决方案5
0 2011-11-03 17:53:26

Java Sax解析复杂的大型XML文件

问题描述

5 个解决方案

解决方案1 1 2011-10-24 14:57:54

解决方案2 0 已采纳 2011-10-24 13:48:03

解决方案3 0 2011-10-24 14:30:05

解决方案4 0 2011-10-24 17:54:41

解决方案5 0 2011-11-03 17:53:26

解决方案1
1 2011-10-24 14:57:54

解决方案2
0 已采纳 2011-10-24 13:48:03

解决方案3
0 2011-10-24 14:30:05

解决方案4
0 2011-10-24 17:54:41

解决方案5
0 2011-11-03 17:53:26