简体   繁体   English

用于Java的JTidy或Jsoup

[英]JTidy or Jsoup for Java

Recently I have been developing web scrapers in python with BeautifulSoup. 最近我一直在使用BeautifulSoup在python中开发web scraper。 Now I want to know which libraries are most preferred in Java. 现在我想知道Java中最喜欢哪些库。 I have done some search, mostly I see JTidy and JSoup. 我做了一些搜索,主要是看到JTidy和JSoup。 What is the difference between them? 他们之间有什么区别?

JTidy is more commonly used to tidy the HTML, that is, to fix malformed or faulty HTML, such as unclosed tags, eg, from <div><span>text</div> to <div><span>text</span></div . JTidy更常用于整理 HTML,即修复格式错误或错误的HTML,例如未封闭的标签,例如,从<div><span>text</div><div><span>text</span></div

JSoup , on the other hand, provides a full-blown API to parse HTML and to extract parts of it. 另一方面, JSoup提供了一个完整的API来解析HTML 提取部分HTML。 It allows you to use jQuery like selectors to find elements, or DOM methods , equivalent to the ones you use with JavaScript, such as getElementById . 它允许您使用jQuery之类的选择器来查找元素或DOM方法 ,等同于您使用JavaScript的方法,例如getElementById I'd say JSoup is indeed the BeautifulSoup equivalent of Java. 我说JSoup确实是BeautifulSoup的Java等价物。

For example, to extract the first paragraph of a Wikipedia article with JSoup, you could use the following: 例如,要使用JSoup提取Wikipedia文章的第一段,您可以使用以下内容:

String url = "http://en.wikipedia.org/wiki/Potato";
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
String firstParagraph = paragraphs.first().text();

Or to extract the title from this very own question: 或者从这个非常自己的问题中提取标题:

Document doc = Jsoup.connect("http://stackoverflow.com/questions/12439078/jtidy-or-jsoup-for-java").get();
String question = doc.select("#question-header a").text(); // JTidy or Jsoup for Java

Quite a nice API, eh? 相当不错的API,嗯? :-) :-)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM