简体   繁体   English

Java 相当于 PHP 简单 HTML DOM 解析器

[英]Java equivalent to PHP Simple HTML DOM Parser

Since I have to multithread which I can not eloquently solve in PHP I would like to programm in Java, unfortunately I could not finde a library which will allow me to parse a HTML DOM as robustly, quickly and easily as in PHP Simple HTML DOM Parser. Since I have to multithread which I can not eloquently solve in PHP I would like to programm in Java, unfortunately I could not finde a library which will allow me to parse a HTML DOM as robustly, quickly and easily as in PHP Simple HTML DOM Parser . Do you know alternatives in Java that are as easy to use?您知道 Java 中同样易于使用的替代品吗?

I went from Simple HTML DOM Parser to JSoup and I'm quite happy with it.我从 Simple HTML DOM Parser 转到JSoup ,我对此非常满意。

I can see that we have two challenges here:我可以看到我们在这里面临两个挑战:

  • Parsing of HTML that might not be well-formed XHTML that ease any and nice to parse. HTML 的解析可能不是格式良好的 XHTML,它易于解析且易于解析。 I'd recommend TagSoup library that can read ugly HTML and produce well-formed StaX stream that can be then used elsewhere.我推荐TagSoup库,它可以读取丑陋的 HTML 并生成格式良好的 StaX stream 然后可以在其他地方使用。

  • Building of DOM representaion of HTML document and dealing with that. HTML 文档的 DOM 表示的构建和处理。 As you probably know in JDK there is full-blown implementation of XML DOM ( org.w3c.dom.* ).您可能知道在 JDK 中有 XML DOM ( org.w3c.dom.* ) 的完整实现。 But I guess this is not the type of API you've been looking for.但我想这不是您一直在寻找的 API 类型。 What about DOM4J or older JDOM that can wrap JDK Document and you can enjoy easy to use API?可以包装 JDK 文档的DOM4J或更旧的JDOM怎么样,您可以享受易于使用的 API?

I've successfully used TagSoup as a SAX parser to populate DOM4J Documents which I then query with XPath.我已经成功地使用 TagSoup 作为 SAX 解析器来填充 DOM4J 文档,然后我使用 XPath 进行查询。 It took me a while to work out the incantations - (Scala, but I'm sure that you can convert):我花了一段时间才弄清楚咒语 - (Scala,但我相信你可以转换):

parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val reader = new SAXReader(parserFactory.newSAXParser.getXMLReader)
val doc = reader.read(new InputSource(new StringReader(page)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM