用于Java的JTidy或Jsoup

Question

Recently I have been developing web scrapers in python with BeautifulSoup. 最近我一直在使用BeautifulSoup在python中开发web scraper。 Now I want to know which libraries are most preferred in Java. 现在我想知道Java中最喜欢哪些库。 I have done some search, mostly I see JTidy and JSoup. 我做了一些搜索，主要是看到JTidy和JSoup。 What is the difference between them? 他们之间有什么区别？

Answer 1

JTidy is more commonly used to tidy the HTML, that is, to fix malformed or faulty HTML, such as unclosed tags, eg, from <div><span>text</div> to <div><span>text</span></div . JTidy更常用于整理 HTML，即修复格式错误或错误的HTML，例如未封闭的标签，例如，从<div><span>text</div>到<div><span>text</span></div 。

JSoup , on the other hand, provides a full-blown API to parse HTML and to extract parts of it. 另一方面， JSoup提供了一个完整的API来解析HTML 并提取部分HTML。 It allows you to use jQuery like selectors to find elements, or DOM methods , equivalent to the ones you use with JavaScript, such as getElementById . 它允许您使用jQuery之类的选择器来查找元素或DOM方法，等同于您使用JavaScript的方法，例如getElementById 。 I'd say JSoup is indeed the BeautifulSoup equivalent of Java. 我说JSoup确实是BeautifulSoup的Java等价物。

For example, to extract the first paragraph of a Wikipedia article with JSoup, you could use the following: 例如，要使用JSoup提取Wikipedia文章的第一段，您可以使用以下内容：

String url = "http://en.wikipedia.org/wiki/Potato";
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
String firstParagraph = paragraphs.first().text();

Or to extract the title from this very own question: 或者从这个非常自己的问题中提取标题：

Document doc = Jsoup.connect("http://stackoverflow.com/questions/12439078/jtidy-or-jsoup-for-java").get();
String question = doc.select("#question-header a").text(); // JTidy or Jsoup for Java

Quite a nice API, eh? 相当不错的API，嗯？ :-) :-)

用于Java的JTidy或Jsoup

问题描述

1 个解决方案

解决方案1
11 已采纳 2012-09-15 16:32:44

用于Java的JTidy或Jsoup

问题描述

1 个解决方案

解决方案1 11 已采纳 2012-09-15 16:32:44

解决方案1
11 已采纳 2012-09-15 16:32:44