Creating a Web Crawler in Java EE

Question

I am creating a web crawler using Java EE Technologies. I have created a crawler service which contains the result of the WebCrawler in term CrawlerElement objects which contains information of interest to me.

Currently I am using JSOUP Library in order to do this. But it is not reliable I am attempting the connection three times and also timeout is 10seconds still It is unreliable.

By unreliable I mean even if it can be accessed publicly, It can not be accessed by the crawler program. I know it could be due to robots.txt exclusion but in that also it is allowed but still it is unrealiable.

So I decided to go with URLConnection object which has openConnection and then connect method for doing this.

I have one more requirement which is bugging me and that is : I have to get the response time in milliseconds for a CrawlerElement which means how many seconds it took to load page B from Page A?? and I checked the methods of URLConnection there is no way out in order to do that.

Any ideas in that topic? Can anyone help me?

I was thinking writing a code before and after which takes current time in milliseconds before the gettingContent code and current time in milliseconds subtract and save that milliseconds in database but I was thing whether it would be accurate or not?

Thanks in advance.

EDIT : CURRENT IMPLEMENTATION

Current Implementation which gives me statusCode, contentType etc..

import java.io.IOException;
import java.net.URL;
import java.net.URLConnection;


public class GetContent {
public static void main(String args[]) throws IOException {
    URL url = new URL("http://www.javacoffeebreak.com/faq/faq0079.html");
    long startTime = System.currentTimeMillis();
    URLConnection uc = url.openConnection();
    uc.setRequestProperty("Authorization", "Basic bG9hbnNkZXY6bG9AbnNkM3Y=");
    uc.setRequestProperty("User-Agent", "");
    uc.connect();
    long endTime = System.currentTimeMillis();
    System.out.println(endTime - startTime);
    String contentType = uc.getContentType();
    System.out.println(contentType);
    String statusCode = uc.getHeaderField(0);
    System.out.println(statusCode);     
   }
}

what say is it okay to do this way or I should use heavy API's like Apache HttpClient or Apache Nutch..

Answer 1

It's better to use proven frameworks than reinvent the wheel. Try Apache Nutch(I recommend 1.x branch, 2.x seems to be too raw). It will be a lot of pain to implement own crawler with supporting of parallelism, robots.txt / "noindex" metatag, redirects, reliability... There are so many issues to solve.

Answer 2

OK it means you have did work and getting problems in that API/Library.I know it is terrifying to build one thing and then waste that all code and shift to another one but if it would be possible for you As JSoup is just a parser library and it may cause some more problems to you in future so I suggest you to use these more stable API .You can also use crawler4j for that purpose.
Here is the list of some open source crawler API's and by doing some R&D you can find a good solution for this :)

Answer 3

Try the Apache HttpClient library. I've had good results with it. It seems to be a good bit better for HTTP specific communications.

Creating a Web Crawler in Java EE

Question

3 answers

solution1
3 2012-08-15 17:24:49

solution2
2 ACCPTED 2012-08-15 17:31:55

solution3
0 2012-08-15 17:14:34

Creating a Web Crawler in Java EE

Question

3 answers

solution1 3 2012-08-15 17:24:49

solution2 2 ACCPTED 2012-08-15 17:31:55

solution3 0 2012-08-15 17:14:34

solution1
3 2012-08-15 17:24:49

solution2
2 ACCPTED 2012-08-15 17:31:55

solution3
0 2012-08-15 17:14:34