简体繁体 English

用Java读取BIG XML文件的一小部分的有效方法

[英]Efficient way to read a small part of a BIG XML file in Java

原文 2012-08-24 19:13:30 2 4 java/ xml-parsing/ sax

We have a new requirement: 我们有一个新要求：

There are some BIG xml files keep coming into our system and we will need to process them immediately and quickly using Java. 有一些BIG xml文件不断进入我们的系统，我们需要使用Java立即快速处理它们。 The file is huge but the required information for our processing is inside a element which is very small. 该文件很大，但我们处理所需的信息是在一个非常小的元素内。 ... ... ......

What is the best way to extract this small portion of the data from the huge file before we start processing. 在开始处理之前从大文件中提取这一小部分数据的最佳方法是什么。 If we try to load the entire file, we will get out of memory error immediately due to size. 如果我们尝试加载整个文件，由于大小，我们会立即出现内存不足错误。 What is the efficient way in Java that I can use to get the ..data..data..data.. data element without loading or reading the file line by line. 什么是Java的有效方式，我可以使用它来获取..data..data..data ..数据元素，而无需逐行加载或读取文件。 Is there any SAX Parser that I can use to get this done? 我可以使用SAX Parser来完成这项工作吗？

Thank you 谢谢

4 个解决方案

The SAX parsers are event based and are much faster because they do what you need: they don't read the xml document entirely. SAX解析器是基于事件的，速度更快，因为它们可以满足您的需求：它们不会完全读取xml文档。 There is a SAXParser available in the Java distributions. Java发行版中提供了SAXParser 。

I had to parse huge files in a previous project (1G-2G) and didn't want to deal with using SAX. 我不得不在之前的项目（1G-2G）中解析大量文件，并且不想处理使用SAX。 I find SAX too low-level in some instances and like keepings a traversal approach in most cases. 我发现在某些情况下SAX太低级了，并且在大多数情况下保留了遍历方法。

I have used the VTD library http://vtd-xml.sourceforge.net/ . 我使用了VTD库http://vtd-xml.sourceforge.net/ 。 It's an EXTREMELY fast library that uses pointers to navigate through the document. 它是一个极其快速的库，它使用指针来浏览文档。

Well, if you want to read a part of a file, you will need to read each line of the file to be able to identify the part of the file of interest and then extract what you need. 好吧，如果你想读一个文件的一部分，你需要阅读文件的每一行能够识别感兴趣的文件的一部分，然后提取你所需要的。

If you only need a small portion of the incoming XML, you can either use SAX, or if you need to read only specific elements or attributes, you could use XPath, which would be a lot simpler to implement. 如果您只需要传入XML的一小部分，您可以使用SAX，或者如果您只需要读取特定元素或属性，则可以使用XPath，这将更容易实现。

Java comes with a built-in SAXParser implementation as well as an XPath implementation. Java附带了内置的SAXParser实现以及XPath实现。 Find the javadocs for SAXParser here and for XPath here . 查找的SAXParser的javadoc 这里和XPath的位置。