[英]Crawler4j vs. Jsoup for the pages crawling and parsing in Java
I want to get the content of a page and extract the specific parts of it. 我想获取页面的内容并提取其中的特定部分。 As far as I know, there are at least two solutions for such task: Crawler4j and Jsoup .
据我所知,这个任务至少有两个解决方案: Crawler4j和Jsoup 。
Both of them are capable retrieving the content of a page and extract sub-parts of it. 它们都能够检索页面的内容并提取它的子部分。 The only thing I'm not sure about, what is the difference between them?
我唯一不确定的是,它们之间有什么区别? There is a similar question , which is marked as answered:
有一个类似的问题 ,标记为已回答:
Crawler4j is a crawler, Jsoup is a parser.
Crawler4j是一个爬虫,Jsoup是一个解析器。
But I just checked, Jsoup is also capable crawling a page in addition to a parsing functionality, while Crawler4j is capable not only crawling the page but parsing its content. 但我刚刚检查过,除了解析功能之外,Jsoup还能够抓取页面,而Crawler4j不仅可以抓取页面而且可以解析其内容。
Thus, can you, please, clarify the difference between Crawler4j and Jsoup? 那么,请你澄清Crawler4j和Jsoup之间的区别吗?
Crawling is something bigger than just retrieving the contents of a single URI. 爬行比仅检索单个URI的内容更重要。 If you just want to retrieve the content of some pages then there is no real benefit from using something like
Crawler4J
. 如果您只想检索某些页面的内容,那么使用像
Crawler4J
这样的东西并没有什么Crawler4J
。
Let's take a look at an example. 我们来看一个例子吧。 Assume you want to crawl a website.
假设您要抓取网站。 The requirements would be:
要求是:
About
page has a link for the Home
page, but we already got the contents of Home
page so don't visit it again). About
页面有Home
的链接,但我们已经获得了Home
的内容,因此不要再访问它)。 Home
page. Home
开始检索50个URI。 This is a simple scenario. 这是一个简单的场景。 Try solving this with
Jsoup
. 尝试用
Jsoup
解决这个Jsoup
。 All this functionality must be implemented by you. 所有这些功能必须由您实施。 Crawler4J or any crawler micro framework for that matter, would or should have an implementation for the actions above.
对于这个问题,Crawler4J或任何爬虫微框架将会或者应该具有上述操作的实现。
Jsoup
's strong qualities shine when you get to decide what to do with the content. 当你决定如何处理内容时,
Jsoup
的强大品质就会闪耀。
Let's take a look at some requirements for parsing. 我们来看看解析的一些要求。
HTML
specs) HTML
规范的标记) This is where Jsoup
comes to play. 这就是
Jsoup
发挥作用的地方。 Of course, there is some overlapping here. 当然,这里有一些重叠。 Some things might be possible with both
Crawler4J
or Jsoup
, but that doesn't make them equivalent. 使用
Crawler4J
或Jsoup
可能有些事情,但这并不能使它们等效。 You could remove the mechanism of retrieving content from Jsoup
and still be an amazing tool to use. 您可以删除从
Jsoup
检索内容的机制,仍然是一个令人惊奇的工具。 If Crawler4J
would remove the retrieval, then it would lose half of its functionality. 如果
Crawler4J
将删除检索,那么它将失去一半的功能。
I used both of them in the same project in a real life scenario. 我在现实生活场景中的同一个项目中使用了它们。 I crawled a site, leveraging the strong points of
Crawler4J
, for all the problems mentioned in the first example. 我
Crawler4J
了一个网站,利用Crawler4J
,解决了第一个例子中提到的所有问题。 Then I passed the content of each page I retrieved to Jsoup
, in order to extract the information I needed. 然后我将检索到的每个页面的内容传递给
Jsoup
,以便提取我需要的信息。 Could I have not used one or the other? 我可以没用过其中一个吗? Yes, I could, but I would have had to implement all the missing functionality.
是的,我可以,但我必须实现所有缺少的功能。
Hence the difference, Crawler4J
is a crawler with some simple operations for parsing (you could extract the images in one line), but there is no implementation for complex CSS
queries. 因此,差异,
Crawler4J
是一个爬虫,有一些简单的解析操作(你可以在一行中提取图像),但没有实现复杂的CSS
查询。 Jsoup
is a parser that gives you a simple API for HTTP
requests. Jsoup
是一个解析器,为您提供HTTP
请求的简单API。 For anything more complex there is no implementation. 对于任何更复杂的事情,都没有实施。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.