简体   繁体   English

使用SAX解析器解析大型XML文件(跳过一些行/标记)

[英]Parsing big XML files using SAX parser (skip some lines/tags)

I am currently developing an app that retrieves data from the internet using SAX. 我目前正在开发一个使用SAX从互联网上检索数据的应用程序。 I used it before for parsing simple XML files like Google Weather API. 之前我用它来解析像Google Weather API这样的简单XML文件。 However, the websites that I am interested in take parsing to the next level. 但是,我感兴趣的网站会将解析提升到一个新的水平。 The page is huge and looks messy. 页面很大,看起来很乱。 I only need to retrieve some specific lines; 我只需要检索一些特定的行; the rest is not useful for me. 其余的对我没用。
Is it possible to skip those useless lines/tags, or do I have to go step by step? 有可能跳过那些无用的线/标签,还是我必须一步一步走?

I like commons-digester. 我喜欢公共消化器。 It allows you to specify rules against particular tags. 它允许您指定针对特定标记的规则。 The rule gets executed only when the tag is encountered. 只有在遇到标记时才会执行规则。

Digester is built over sax and hence has all the sax features plus the specificity that is required for selectively parsing specific tags. Digester是基于sax构建的,因此具有所有sax功能以及选择性解析特定标记所需的特异性。 It also uses a stack that is pushed with new elements as and when the corresponding tag is encountered and is popped when the element ends. 它还使用一个堆栈,当遇到相应的标记时,该堆栈会被新元素推送,并在元素结束时弹出。

I use it for parsing all my configuration files. 我用它来解析我的所有配置文件。

Check out digester at http://commons.apache.org/digester/ 查看消化器, 网址http://commons.apache.org/digester/

Yes you can do it, just ignore the tags you are not interested in. But note that the entire document will have to be parsed for this (DefaultHandler impl) 是的,您可以这样做,只需忽略您不感兴趣的标签。但请注意,必须为此解析整个文档(DefaultHandler impl)

public startElement(String uri, String localName, 
     String qName, Attributes attributes)  {
  if(localName.equals("myInterestingTag") {
     // do your thing....
  }
}

public void endElement(String uri, String localName, String qName) {
  if(localName.equals("myInterestingTag") {
     // do your thing....
  }
}

public void characters(char[] ch, int start, int length) {
  // if parsing myinteresting tag... do some stuff.
}

Yes, you can skip. 是的,你可以跳过。 Just define those tag which you want and it will only fetch those tag values. 只需定义您想要的那些标记,它只会获取这些标记值。

You can try to use XPath which will use SAX behind the scene to parse your xml. 您可以尝试使用XPath,它将在场景后面使用SAX来解析您的xml。 The downside here is that XML will be parsed on every call of Xpath evaluate method. 这里的缺点是每次调用Xpath evaluate方法时都会解析XML。

您想要读取特定标签,然后DOM解析器比SAX解析器快得多。如果您想要解析大型XML文件,则ASAX解析器很有用。

您可以尝试使用TagSoup组合来创建可解析的XML文档,并尝试使用XPath来获取有趣的部分。

See my answer to a similar question for a strategy of using SAX to skip/ignore tags: 有关使用SAX跳过/忽略标记的策略,请参阅我对类似问题的回答:

Skipping nodes with sax 使用sax跳过节点

It involves switching ContentHandlers on the XMLReader. 它涉及在XMLReader上切换ContentHandlers。 When you read a porting of the XML document you want to skip you simply swap in a ContentHandler that does nothing with the events. 当您阅读要跳过的XML文档的移植时,只需交换一个不对事件做任何事情的ContentHandler。 When the end of the section to be ignored is reached it passes control back to the content handler you were using to process the XML content. 当达到要忽略的部分的末尾时,它将控制权传递回您用于处理XML内容的内容处理程序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM