简体   繁体   中英

How to implement a Java crawler to crawl for PDF-file links?

Task details: Java web PDF Crawler Tool: Eclipse

I wanted to get a .pdf link as output. How to get that in java? Below mentioned should comes as output after crawling http://namastenepal.de . - http://namastenepal.de/menu_namaste_nepal_chemnitz_vegan_vegetarisch.pdf

From below mentioned it gives all html links(href): http://www.netinstructions.com/how-to-make-a-simple-web-crawler-in-java/ As same like this I want to get output of files.

Kindly give me suggestions.

Thanks

You can use crawler4j (see https://github.com/yasserg/crawler4j ) and adjust the shouldVisit(...) method and visit(...) in WebCrawler.class for your use-case accordingly.

For your given example only, it would be something like:

 @Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     //only visit pages from namastenepal.de
     return href.startsWith("http://namastenepal.de");
 }

and

 @Override
 public void visit(Page page) {
     String url = page.getWebURL().getURL();

     //only process urls ending with .pdf after visting them...
     if (url.endsWith(".pdf") {
         //do something
     }
}

Note, that you cannot only include .pdf in shouldVisit(...) as you need to enable the crawler to traverse the given website to find .pdf links. For this reason, it needs to allow non .pdf links.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM