[英]extracting xml node(not text but complete xml ) and with other test nodes from xml file using SAX parser in java
I have to read from large xml files each ranging ~500MB. 我必须读取每个〜500MB的大型xml文件。 The batch processes typically 500 such files in each run.
批处理通常每次运行500个此类文件。 I have to extract text nodes from it and at the same time extract xml nodes from it .
我必须从中提取文本节点, 同时从中提取xml节点 。 I used xpath DOM in java for easy of use but that doesn't work due to memory issues as i have limited resources.
我在Java中使用了xpath DOM以便于使用,但是由于内存问题而无法使用,因为我的资源有限。
I intent to use SAX or stax in java now - the text nodes can be easily extracted but i don't know how to extract xml nodes from xml using sax. 我打算现在在Java中使用SAX或stax-可以轻松提取文本节点,但是我不知道如何使用sax从xml提取xml节点。
a sample: 一个样品:
<?xml version="1.0"?>
<Library>
<Book name = "ABC">
<Author>John</Author>
<PrintingCompanyDT><Printer>Sam</Printer><Printmachine>Laser</Printmachine>
<AssocPrint>Oreilly</AssocPrint> </PrintingCompanyDT>
</Book>
<Book name = "123">
<Author>Mason</Author>
<PrintingCompanyDTv<Printervkelly</Printer><Printmachine>DOTPrint</Printmachine>
<AssocPrint>Oxford</AssocPrint> </PrintingCompanyDT>
</Book>
</Library>
The expected result: 1)Book: ABC: 预期结果:1)图书:ABC:
Author:John 作者:约翰
PrintCompany Detail XML: 印刷公司详细资料XML:
<PrintingCompanyDT>
<Printer>Sam</Printer>
<Printmachine>Laser</Printmachine>
<AssocPrint>Oreilly</AssocPrint>
</PrintingCompanyDT>
2) Book: 123 2)本书:123
Author : Mason 作者:梅森
PrintCompany Detail XML: 印刷公司详细资料XML:
<PrintingCompanyDT>
<Printer>kelly</Printer>
<Printmachine>DOTPrint</Printmachine>
<AssocPrint>Oxford</AssocPrint>
</PrintingCompanyDT>
If i try in the regular way of appending characters in public void characters(char ch[], int start, int length) method I get the below 如果我尝试以常规方式将字符添加到公共无效字符(char ch [],int开头,int长度)方法中,则会得到以下内容
1)Book: ABC: 1)图书:ABC:
Author:John 作者:约翰
PrintCompany Detail XML : PrintCompany详细资料XML:
Sam
Laser
Oreilly
exactly the content and spaces. 确切的内容和空格。
Can somebody suggest how to extract an xml node as it is from a xml file through SAX or StaX parser in java. 有人可以建议如何通过Java中的SAX或StaX解析器从xml文件中提取xml节点。
I'd be tempted to use XOM for this sort of task rather than SAX or StAX directly. 我很想将XOM用于此类任务,而不是直接使用SAX或StAX。 XOM is a tree-based representation similar to DOM or JDOM but it has support for processing XML "twigs" in a kind of semi-streaming fashion, ideal for your kind of case where you have many similar elements that can be processed independently of one another.
XOM是类似于DOM或JDOM的基于树的表示形式,但是它支持以半流方式处理XML“树枝”,非常适合您具有许多可以独立处理的相似元素的情况另一个。 Also every
Node
has a toXML
method that prints the node as XML. 同样,每个
Node
都有一个toXML
方法,该方法将节点打印为XML。
import nu.xom.*;
public class LibraryProcessor extends NodeFactory {
private Nodes empty = new Nodes();
private bookNum = 0;
/** Called for each closing tag in the XML */
public Nodes finishMakingElement(Element element) {
if("Book".equals(element.getLocalName())) {
bookNum++;
// process the complete Book element ...
processBook(element);
// ... and throw it away
return empty;
} else {
// process other elements (except Book) in the normal way
return super.finishMakingElement(element);
}
}
private void processBook(Element book) {
System.out.println(bookNum + ": " +
book.getAttributeValue("name"));
System.out.println("Author: " +
book.getFirstChildElement("Author").getValue());
System.out.println("PrintCompany Detail XML: " +
book.getFirstChildElement("PrintingCompanyDT").toXML());
}
public static void main(String[] args) throws Exception {
Builder builder = new Builder(new LibraryProcessor());
builder.build(new File(args[0]));
}
}
This will work its way through the XML document, calling processBook
once for each Book
element in turn. 这将遍历XML文档,
processBook
对每个Book
元素调用一次processBook
。 Within processBook
you have access to the whole Book
XML tree as XOM nodes, but without having to load the entire file into memory in one go - the best of both worlds. 在
processBook
您可以作为XOM节点访问整个Book
XML树,而不必一次将整个文件加载到内存中-两全其美。 The "Factories, Filters, Subclassing, and Streaming" section of the XOM tutorial has more detail on this technique. XOM教程的“工厂,过滤器,子类和流传输”部分对这种技术进行了更详细的介绍。
This example just shows the most basic bits of the XOM API, but it also provides powerful XPath support if you need to do more complex processing. 该示例仅显示XOM API的最基本的部分,但是如果您需要执行更复杂的处理,它还提供了强大的XPath支持。 For example, you can directly access the
PrintMachine
element within processBook
using 例如,您可以直接访问
PrintMachine
内元素processBook
使用
Element machine = (Element)book.query("PrintingCompanyDT/PrintMachine").get(0);
or if the structure is not so regular, for example if PrintingCompanyDT
is sometimes a direct child of Book
and sometimes deeper (eg a grandchild) then you can use a query like 或者如果结构不是那么规则,例如,如果
PrintingCompanyDT
有时是Book
的直接子代,有时甚至是Book
更深子代(例如,孙子代),则可以使用以下查询
Element printingCompanyDT = (Element)book.query(".//PrintingCompanyDT").get(0);
( //
being the XPath notation for finding descendants at any level, as opposed to /
which looks only for direct children). (
//
是XPath表示法,用于查找任何级别的后代,而/
则只用于直接子代。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.