简体   繁体   English

用于响应的HTML解析器-Java

[英]HTML Parser for response - Java

Im using HttpClient to access a particualr website and the response i get is in the form of an HTML. 我使用HttpClient访问特定的网站,我得到的响应是HTML形式。 Which parser or method I should use the parser the HTML and get what I want from the response. 我应该使用HTML的解析器或方法,并从响应中获取所需的信息。 Note: Im using HttpClient with Java 注意:我在Java中使用HttpClient

Use jsoup . 使用jsoup

jsoup is a Java library for working with real-world HTML. jsoup是一个用于处理实际HTML的Java库。 It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. 它提供了使用DOM,CSS和类似jquery的最好方法提取和处理数据的非常方便的API。

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. jsoup实现WHATWG HTML5规范,并将HTML解析为与现代浏览器相同的DOM。

  • scrape and parse HTML from a URL, file, or string 从URL,文件或字符串中抓取并解析HTML
  • find and extract data, using DOM traversal or CSS selectors 使用DOM遍历或CSS选择器查找和提取数据
  • manipulate the HTML elements, attributes, and text 处理HTML元素,属性和文本
  • clean user-submitted content against a safe white-list, to prevent XSS attacks 根据安全的白名单清除用户提交的内容,以防止XSS攻击
  • output tidy HTML 输出整洁的HTML

jsoup is designed to deal with all varieties of HTML found in the wild; jsoup旨在处理野外发现的所有HTML; from pristine and validating, to invalid tag-soup; 从原始和验证到无效的标签汤; jsoup will create a sensible parse tree. jsoup将创建一个明智的解析树。

I would give htmlcleaner a try. 我会尝试htmlcleaner

HTMLCleaner is Java library used to safely parse and transform any HTML found on web to well-formed XML. HTMLCleaner是Java库,用于安全地解析和转换Web上找到的任何HTML到格式良好的XML。 It is designed to be small, fast, flexible and independant. 它被设计为小型,快速,灵活和独立的。 HtmlCleaner may be used in java code, as command line tool or as Ant task. HtmlCleaner可以在Java代码中用作命令行工具或Ant任务。 Result of parsing is lightweight document object model which can easily be transformed to standards like DOM or JDom, or serialized to XML output in various ways (compact, pretty printed and so on). 解析的结果是轻量级的文档对象模型,可以轻松地将其转换为诸如DOM或JDom之类的标准,或者以各种方式(紧凑,精美打印等)序列化为XML输出。

You can use XPath with htmlcleaner to get contents within xml/html tags.Here is a nice 您可以将XPath与htmlcleaner一起使用以获取xml / html标记中的内容。
example Xpath Example Xpath示例

Sample code with jsoup and Java8: jsoup和Java8的示例代码:

// Imports:
...
import java.nio.charset.StandardCharsets;
import org.apache.commons.io.IOUtils;
...

// Execute the GET request:
...
HttpClient clientGet = HttpClientBuilder.create().build();
HttpGet get = new HttpGet(url);
HttpResponse res = clientGet.execute(get);

// Use jsoup to parse the html response:
// E.g. find all links with reference to myapp:
//  <a href="myapp">HelloWorldApp</a>
Document doc = Jsoup.parse(IOUtils.toString(res.getEntity().getContent(), StandardCharsets.UTF_8));
Elements links = doc.select("a[href~=myapp]");
for (Element link : links)
    String appName = link.html();
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM