Get Http status using crawler4j & Jsoup

Question

I am creating a Groovy & Grails app using MongoDB in the backend. I am using crawler4j for crawling and JSoup for parsing functionality. I need to get the http status of a URL and save it to database. I am trying the following:

@Override
void visit(Page page) {
try{
    Document doc = Jsoup.connect(url).get();
    Connection.Response response = Jsoup.connect(url)
            .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chroe/19.0.1042.0 Safari/535.21")
            .timeout(10000)
            .execute();
    int statusCode = response.statusCode();
    println "statuscode is " + statusCode
    if (statusCode == 200)
        urlExists = true    //urlExists is a boolean variable
    else 
        urlExists = false
    //save to database
    resource = new Resource(mimeType : "text/html", URLExists: urlExists)
    if (!resource.save(flush: true, failOnError: true)) {
        resource.errors.each { println it }
    }
    //other code
    }catch(Exception e) {
        log.error "Exception is ${e.message}"
    }
}
@Override
protected void handlePageStatusCode(WebURL webUrl, int statusCode, String statusDescription) {
    if (statusCode != HttpStatus.SC_OK) {
        if (statusCode == HttpStatus.SC_NOT_FOUND) {
        println "Broken link: " + webUrl.getURL() + ", this link was found in page: " + webUrl.getParentUrl()
        }
        else {
            println  "Non success status for link: " + webUrl.getURL() + ", status code: " + statusCode + ", description: " + statusDescription
        }
    }
}

The problem is as soon as I get a url with http status other than 200(ok), it directly goes to the handlePageStatusCode() method (because of inherent crawler4j functionality) and prints the non success message but it doesnt get saved to the database. Is there any way that I can save to the database when the page status is not 200? If I am doing something wrong, please tell me. Thanks

Answer 1

Why dont you save it to the database when it goes down to handlePageStatusCode?

    protected void handlePageStatusCode(WebURL webUrl, int statusCode, String statusDescription) {
if (statusCode != HttpStatus.SC_OK) {
    if (statusCode == HttpStatus.SC_NOT_FOUND) {
    println "Broken link: " + webUrl.getURL() + ", this link was found in page: " + webUrl.getParentUrl()

    //save to database

    }
    else {
        println  "Non success status for link: " + webUrl.getURL() + ", status code: " +        tatusCode + ", description: " + statusDescription
    }
  }

}

It will then try the next link and you can do the same thing.

Or you can save it before

  if (statusCode == 200)
    urlExists = true    //urlExists is a boolean variable
     else {
           //save to database
           urlExists = false
        }

EDIT****

Add the webUrl.getURL() to an ArrayList, then at the end save it to a database.

Get Http status using crawler4j & Jsoup

Question

1 answers

solution1
0 2014-07-16 16:15:11

Get Http status using crawler4j & Jsoup

Question

1 answers

solution1 0 2014-07-16 16:15:11

solution1
0 2014-07-16 16:15:11