简体   繁体   English

使用Java中的SAX解析器从xml文件中提取xml节点(不是文本,而是完整的xml)以及其他测试节点

[英]extracting xml node(not text but complete xml ) and with other test nodes from xml file using SAX parser in java

I have to read from large xml files each ranging ~500MB. 我必须读取每个〜500MB的大型xml文件。 The batch processes typically 500 such files in each run. 批处理通常每次运行500个此类文件。 I have to extract text nodes from it and at the same time extract xml nodes from it . 我必须从中提取文本节点, 同时从中提取xml节点 I used xpath DOM in java for easy of use but that doesn't work due to memory issues as i have limited resources. 我在Java中使用了xpath DOM以便于使用,但是由于内存问题而无法使用,因为我的资源有限。

I intent to use SAX or stax in java now - the text nodes can be easily extracted but i don't know how to extract xml nodes from xml using sax. 我打算现在在Java中使用SAX或stax-可以轻松提取文本节点,但是我不知道如何使用sax从xml提取xml节点。

a sample: 一个样品:

<?xml version="1.0"?>
<Library>
  <Book name = "ABC">
    <Author>John</Author>
    <PrintingCompanyDT><Printer>Sam</Printer><Printmachine>Laser</Printmachine>    
    <AssocPrint>Oreilly</AssocPrint> </PrintingCompanyDT>
  </Book>
  <Book name = "123">
    <Author>Mason</Author>
    <PrintingCompanyDTv<Printervkelly</Printer><Printmachine>DOTPrint</Printmachine>
    <AssocPrint>Oxford</AssocPrint> </PrintingCompanyDT>
  </Book>
</Library>

The expected result: 1)Book: ABC: 预期结果:1)图书:ABC:
Author:John 作者:约翰
PrintCompany Detail XML: 印刷公司详细资料XML:

<PrintingCompanyDT>
  <Printer>Sam</Printer>
  <Printmachine>Laser</Printmachine>
  <AssocPrint>Oreilly</AssocPrint> 
</PrintingCompanyDT>


2) Book: 123 2)本书:123
Author : Mason 作者:梅森
PrintCompany Detail XML: 印刷公司详细资料XML:

<PrintingCompanyDT>
  <Printer>kelly</Printer>
  <Printmachine>DOTPrint</Printmachine>
  <AssocPrint>Oxford</AssocPrint>
</PrintingCompanyDT>


If i try in the regular way of appending characters in public void characters(char ch[], int start, int length) method I get the below 如果我尝试以常规方式将字符添加到公共无效字符(char ch [],int开头,int长度)方法中,则会得到以下内容
1)Book: ABC: 1)图书:ABC:
Author:John 作者:约翰
PrintCompany Detail XML : PrintCompany详细资料XML:

Sam 
  Laser
      Oreilly

exactly the content and spaces. 确切的内容和空格。

Can somebody suggest how to extract an xml node as it is from a xml file through SAX or StaX parser in java. 有人可以建议如何通过Java中的SAX或StaX解析器从xml文件中提取xml节点。

I'd be tempted to use XOM for this sort of task rather than SAX or StAX directly. 我很想将XOM用于此类任务,而不是直接使用SAX或StAX。 XOM is a tree-based representation similar to DOM or JDOM but it has support for processing XML "twigs" in a kind of semi-streaming fashion, ideal for your kind of case where you have many similar elements that can be processed independently of one another. XOM是类似于DOM或JDOM的基于树的表示形式,但是它支持以半流方式处理XML“树枝”,非常适合您具有许多可以独立处理的相似元素的情况另一个。 Also every Node has a toXML method that prints the node as XML. 同样,每个Node都有一个toXML方法,该方法将节点打印为XML。

import nu.xom.*;

public class LibraryProcessor extends NodeFactory {
  private Nodes empty = new Nodes();
  private bookNum = 0;

  /** Called for each closing tag in the XML */
  public Nodes finishMakingElement(Element element) {
    if("Book".equals(element.getLocalName())) {
      bookNum++;
      // process the complete Book element ...
      processBook(element);
      // ... and throw it away
      return empty;
    } else {
      // process other elements (except Book) in the normal way
      return super.finishMakingElement(element);
    }
  }

  private void processBook(Element book) {
    System.out.println(bookNum + ": " +
        book.getAttributeValue("name"));
    System.out.println("Author: " +
        book.getFirstChildElement("Author").getValue());
    System.out.println("PrintCompany Detail XML: " +
        book.getFirstChildElement("PrintingCompanyDT").toXML());
  }

  public static void main(String[] args) throws Exception {
    Builder builder = new Builder(new LibraryProcessor());
    builder.build(new File(args[0]));
  }
}

This will work its way through the XML document, calling processBook once for each Book element in turn. 这将遍历XML文档, processBook对每个Book元素调用一次processBook Within processBook you have access to the whole Book XML tree as XOM nodes, but without having to load the entire file into memory in one go - the best of both worlds. processBook您可以作为XOM节点访问整个Book XML树,而不必一次将整个文件加载到内存中-两全其美。 The "Factories, Filters, Subclassing, and Streaming" section of the XOM tutorial has more detail on this technique. XOM教程的“工厂,过滤器,子类和流传输”部分对这种技术进行了更详细的介绍。

This example just shows the most basic bits of the XOM API, but it also provides powerful XPath support if you need to do more complex processing. 该示例仅显示XOM API的最基本的部分,但是如果您需要执行更复杂的处理,它还提供了强大的XPath支持。 For example, you can directly access the PrintMachine element within processBook using 例如,您可以直接访问PrintMachine内元素processBook使用

Element machine = (Element)book.query("PrintingCompanyDT/PrintMachine").get(0);

or if the structure is not so regular, for example if PrintingCompanyDT is sometimes a direct child of Book and sometimes deeper (eg a grandchild) then you can use a query like 或者如果结构不是那么规则,例如,如果PrintingCompanyDT有时是Book的直接子代,有时甚至是Book更深子代(例如,孙子代),则可以使用以下查询

Element printingCompanyDT = (Element)book.query(".//PrintingCompanyDT").get(0);

( // being the XPath notation for finding descendants at any level, as opposed to / which looks only for direct children). //是XPath表示法,用于查找任何级别的后代,而/则只用于直接子代。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM