简体繁体中英

Web Crawler vs Html Parser

原文 2018-11-14 16:40:30 1 2 java/ web-crawler/ jsoup/ crawler4j

What is the difference between web crawler and parser?

In java there are some name for fetching libraries . For example , they name nutch as a crawler and jsoup as a parser .

Are they do the same purpose?

Are they fully similar for the job?

thanks

2 answers

The jsoup library is a Java library for working with real-world HTML. It is capable of fetching and working with HTML. However, it is not a Web-Crawler in general as it is only capable of fetching one page at a time (without writing a custom program (=crawler) using jsoup to fetch, extract and fetch new urls).

A Web crawler uses a HTML parser to extract URLs from a previously fetched Website and adds this newly discovered URL to its frontier .

A general sequence diagram of a Web crawler can be found in this answer: What sequence of steps does crawler4j follow to fetch data?

To summarize it:

A HTML parser is a necessary component of a Web crawler for parsing and extracting URLs from given HTML input. However, a HTML parser alone, is not a Web crawler as it lacks some necessary features such as maintaining previously visted URLs, politeness, etc.

This is easily answered by looking this up on Wikipedia:

A parser is a software component that takes input data (frequently text) and builds a data structure

https://en.wikipedia.org/wiki/Parsing#Computer_languages

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an [Internet bot] that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).

https://en.wikipedia.org/wiki/Web_crawler

Parsing HTML in web crawler

Getting a full html source code of a page for making a web crawler

Extending a basic web crawler to filter status codes and HTML

Multithreaded Web Crawler in Java

Web Crawler using jsoup

exception in web crawler with selenium

Simple web crawler on android?

Java Web crawler and scraper

Web crawler encounter javascript

Java Web Crawler Libraries

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Parsing HTML in web crawler Getting a full html source code of a page for making a web crawler Extending a basic web crawler to filter status codes and HTML Multithreaded Web Crawler in Java Web Crawler using jsoup exception in web crawler with selenium Simple web crawler on android? Java Web crawler and scraper Web crawler encounter javascript Java Web Crawler Libraries

Related Tags

Web Crawler vs Html Parser

Question

2 answers

solution1
1 ACCPTED 2018-12-10 10:20:56

solution2
0 2018-11-14 16:45:57

Web Crawler vs Html Parser

Question

2 answers

solution1 1 ACCPTED 2018-12-10 10:20:56

solution2 0 2018-11-14 16:45:57

solution1
1 ACCPTED 2018-12-10 10:20:56

solution2
0 2018-11-14 16:45:57