简体   繁体   中英

How to design a web crawler in Java?

I'm working on a project which needs to design a web crawler in Java which can take a user query about a particular news subject and then visit different news websites and then extract news content from those pages and store it in some files/databases. I need this to make a summary of overall stored content. I'm new to this field and so expect some help from people who have any experience how to do it.

Right now I have the code to extract news content from a single page which takes the page manually, but I have no idea how to integrate it in a web crawler to extract content from different pages.

Can anyone give some good links to tutorials or implementations in Java which I can use or modify according to my needs?

http://jsoup.org/

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

One word of advice in addition to the other answers - make sure that your crawler respects robots.txt (ie does not crawl sites rapidly and indiscriminately) or you are likely to get yourself/your organisation blocked by the sites you want to visit.

Here are some open source Java libraries that most people would recommend,

My personal favourite would be Java Web Crawler, in terms of its speed and easiness to configure.

btw, if it's not something that big, for an assignment, if your source websites are NOT changing frequently, I would recommend implementing a simple HTML parser.

Hope it will help

I'd recommend that you check out my answers here: How can I bring google-like recrawling in my application(web or console) and Designing a web crawler

The first answer was provided for a C# question, but it's actually a language agnostic answer so it applies to Java too. Check out the links I've provided in both answers, there is some good reading material. I'd also say that you should try one of the already existing java crawlers , rather than writing one yourself (it's not a small project).

...a web crawler in java which can take a user query about a particular news subject and then visits different news websites and then extracts news content from those pages and store it in some files/databases.

That requirement seem to go beyond the scope of "just a crawler" and go into the area of machine learning and natural language processing. If you have a list of websites for which you're sure that they serve news, then you might be able to extract the news content. However, even then you have to determine what part of the website is news and what's not (ie there might also be links, ads, comments, etc). So exactly what kind of requirements are you facing here? Do you have a list of news websites? Do you have a reliable way to extract news?

I found this article to be really helpful when i was reading about Web Crawlers.

It provides a step by step guide to developing a multi-threaded crawler.

In essence, the following is a very high level view of what a crawler should do

- Insert first URL in the queue

Loop until enough documents are gathered:
   - Get first URL from the queue and save the document
   - Extract links from the saved document and insert them in the queue

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM