How to implement a Java crawler to crawl for PDF-file links?

Question

Task details: Java web PDF Crawler Tool: Eclipse

I wanted to get a .pdf link as output. How to get that in java? Below mentioned should comes as output after crawling http://namastenepal.de . - http://namastenepal.de/menu_namaste_nepal_chemnitz_vegan_vegetarisch.pdf

From below mentioned it gives all html links(href): http://www.netinstructions.com/how-to-make-a-simple-web-crawler-in-java/ As same like this I want to get output of files.

Kindly give me suggestions.

Thanks

Answer 1

You can use crawler4j (see https://github.com/yasserg/crawler4j ) and adjust the shouldVisit(...) method and visit(...) in WebCrawler.class for your use-case accordingly.

For your given example only, it would be something like:

 @Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     //only visit pages from namastenepal.de
     return href.startsWith("http://namastenepal.de");
 }

and

 @Override
 public void visit(Page page) {
     String url = page.getWebURL().getURL();

     //only process urls ending with .pdf after visting them...
     if (url.endsWith(".pdf") {
         //do something
     }
}

Note, that you cannot only include .pdf in shouldVisit(...) as you need to enable the crawler to traverse the given website to find .pdf links. For this reason, it needs to allow non .pdf links.

How to implement a Java crawler to crawl for PDF-file links?

Question

1 answers

solution1
1 2018-12-13 11:12:58

How to implement a Java crawler to crawl for PDF-file links?

Question

1 answers

solution1 1 2018-12-13 11:12:58

solution1
1 2018-12-13 11:12:58