简体繁体 English

使用jpedal从html提取超链接？ --java

[英]using jpedal to extract hyperlinks from html? --java

原文 2011-10-05 19:21:26 2 1 java/ html/ parsing/ dom/ jpedal

JPedal library in java is usually used to convert pdf to XML or HTML. Java中的JPedal库通常用于将pdf转换为XML或HTML。 However, I needed to know if we could extract data from HTML5 document and save it to XML using JPedal library API? 但是，我需要知道是否可以使用JPedal库API从HTML5文档中提取数据并将其保存为XML？ Is there any other possible alternative to this? 还有其他可能的替代方法吗？

Also , I am trying to parse HTML5 document using Java and store it in XML. 另外，我正在尝试使用Java解析HTML5文档并将其存储为XML。 are there any good solutions to find just specific tags and render an XML out of them? 有什么好的解决方案可以找到特定的标签并从其中渲染XML？

Please do let me know . 请让我知道。 Thank you. 谢谢。

1 个解决方案

There are a number of Java HTML parsers out there, but I recommend using the HTML5 parser from validator.nu available for download from here: http://about.validator.nu/htmlparser/ . 那里有许多Java HTML解析器，但是我建议使用来自validator.nu的HTML5解析器，可以从以下网址下载： http : //about.validator.nu/htmlparser/ 。

Written to use the HTML5 parser algorithm by one of the main protagonists of HTML5, Henri Sivonen of Mozilla, you won't find a more reliable HTML parser and it creates a true DOM that can be manipulated using standard XML tools and queried for hyperlinks using XPath. 由HTML5的主要角色之一，Mozilla的Henri Sivonen使用HTML5解析器算法编写而成，您将找不到更可靠的HTML解析器，它会创建可使用标准XML工具进行操作并查询超链接的真实DOM。 XPath。 There are examples of how to use XSLT transformations with it and how to get an XML serialization of the created DOM. 有一些示例说明如何将XSLT转换与其一起使用，以及如何获得所创建DOM的XML序列化。