简体繁体 English

Crawler4j与Jsoup一起用于Java中的页面爬行和解析

[英]Crawler4j vs. Jsoup for the pages crawling and parsing in Java

原文 2016-01-19 22:55:11 1 1 java/ web-crawler/ html-parsing/ jsoup/ crawler4j

I want to get the content of a page and extract the specific parts of it. 我想获取页面的内容并提取其中的特定部分。 As far as I know, there are at least two solutions for such task: Crawler4j and Jsoup . 据我所知，这个任务至少有两个解决方案： Crawler4j和Jsoup 。

Both of them are capable retrieving the content of a page and extract sub-parts of it. 它们都能够检索页面的内容并提取它的子部分。 The only thing I'm not sure about, what is the difference between them? 我唯一不确定的是，它们之间有什么区别？ There is a similar question , which is marked as answered: 有一个类似的问题，标记为已回答：

Crawler4j is a crawler, Jsoup is a parser. Crawler4j是一个爬虫，Jsoup是一个解析器。

But I just checked, Jsoup is also capable crawling a page in addition to a parsing functionality, while Crawler4j is capable not only crawling the page but parsing its content. 但我刚刚检查过，除了解析功能之外，Jsoup还能够抓取页面，而Crawler4j不仅可以抓取页面而且可以解析其内容。

Thus, can you, please, clarify the difference between Crawler4j and Jsoup? 那么，请你澄清Crawler4j和Jsoup之间的区别吗？

1 个解决方案

Crawling is something bigger than just retrieving the contents of a single URI. 爬行比仅检索单个URI的内容更重要。 If you just want to retrieve the content of some pages then there is no real benefit from using something like Crawler4J . 如果您只想检索某些页面的内容，那么使用像Crawler4J这样的东西并没有什么Crawler4J 。

Let's take a look at an example. 我们来看一个例子吧。 Assume you want to crawl a website. 假设您要抓取网站。 The requirements would be: 要求是：

Give base URI (home page) 给基URI（主页）
Take all the URIs from each page and retrieve the contents of those too. 从每个页面获取所有URI并检索其中的内容。
Move recursively for every URI you retrieve. 对于您检索的每个URI，递归移动。
Retrieve the contents only of URIs that are inside this website (there could be external URIs referencing another website, we don't need those). 仅检索此网站内的URI的内容（可能存在引用其他网站的外部URI，我们不需要这些URI）。
Avoid circular crawling. 避免圆形爬行。 Page A has URI for page B (of the same site). 页面A具有页面B（同一站点的）的URI。 Page B has URI for page A, but we already retrieved the content of page A (the About page has a link for the Home page, but we already got the contents of Home page so don't visit it again). 页面B具有页面A的URI，但我们已经检索到页面A的内容（“ About页面有Home的链接，但我们已经获得了Home的内容，因此不要再访问它）。
The crawling operation must be multithreaded 爬网操作必须是多线程的
The website is vast. 这个网站很大。 It contains a lot of pages. 它包含很多页面。 We only want to retrieve 50 URIs beginning from Home page. 我们只想从Home开始检索50个URI。

This is a simple scenario. 这是一个简单的场景。 Try solving this with Jsoup . 尝试用Jsoup解决这个Jsoup 。 All this functionality must be implemented by you. 所有这些功能必须由您实施。 Crawler4J or any crawler micro framework for that matter, would or should have an implementation for the actions above. 对于这个问题，Crawler4J或任何爬虫微框架将会或者应该具有上述操作的实现。 Jsoup 's strong qualities shine when you get to decide what to do with the content. 当你决定如何处理内容时， Jsoup的强大品质就会闪耀。

Let's take a look at some requirements for parsing. 我们来看看解析的一些要求。

Get all paragraphs of a page 获取页面的所有段落
Get all images 获取所有图像
Remove invalid tags (tags that do not comply to the HTML specs) 删除无效标记（不符合HTML规范的标记）
Remove script tags 删除脚本标记

This is where Jsoup comes to play. 这就是Jsoup发挥作用的地方。 Of course, there is some overlapping here. 当然，这里有一些重叠。 Some things might be possible with both Crawler4J or Jsoup , but that doesn't make them equivalent. 使用Crawler4J或Jsoup可能有些事情，但这并不能使它们等效。 You could remove the mechanism of retrieving content from Jsoup and still be an amazing tool to use. 您可以删除从Jsoup检索内容的机制，仍然是一个令人惊奇的工具。 If Crawler4J would remove the retrieval, then it would lose half of its functionality. 如果Crawler4J将删除检索，那么它将失去一半的功能。

I used both of them in the same project in a real life scenario. 我在现实生活场景中的同一个项目中使用了它们。 I crawled a site, leveraging the strong points of Crawler4J , for all the problems mentioned in the first example. 我Crawler4J了一个网站，利用Crawler4J ，解决了第一个例子中提到的所有问题。 Then I passed the content of each page I retrieved to Jsoup , in order to extract the information I needed. 然后我将检索到的每个页面的内容传递给Jsoup ，以便提取我需要的信息。 Could I have not used one or the other? 我可以没用过其中一个吗？ Yes, I could, but I would have had to implement all the missing functionality. 是的，我可以，但我必须实现所有缺少的功能。

Hence the difference, Crawler4J is a crawler with some simple operations for parsing (you could extract the images in one line), but there is no implementation for complex CSS queries. 因此，差异， Crawler4J是一个爬虫，有一些简单的解析操作（你可以在一行中提取图像），但没有实现复杂的CSS查询。 Jsoup is a parser that gives you a simple API for HTTP requests. Jsoup是一个解析器，为您提供HTTP请求的简单API。 For anything more complex there is no implementation. 对于任何更复杂的事情，都没有实施。