Task details: Java web PDF Crawler Tool: Eclipse
I wanted to get a .pdf link as output. How to get that in java? Below mentioned should comes as output after crawling http://namastenepal.de . - http://namastenepal.de/menu_namaste_nepal_chemnitz_vegan_vegetarisch.pdf
From below mentioned it gives all html links(href): http://www.netinstructions.com/how-to-make-a-simple-web-crawler-in-java/ As same like this I want to get output of files.
Kindly give me suggestions.
Thanks
You can use crawler4j
(see https://github.com/yasserg/crawler4j ) and adjust the shouldVisit(...)
method and visit(...)
in WebCrawler.class
for your use-case accordingly.
For your given example only, it would be something like:
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
//only visit pages from namastenepal.de
return href.startsWith("http://namastenepal.de");
}
and
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
//only process urls ending with .pdf after visting them...
if (url.endsWith(".pdf") {
//do something
}
}
Note, that you cannot only include .pdf
in shouldVisit(...)
as you need to enable the crawler to traverse the given website to find .pdf
links. For this reason, it needs to allow non .pdf
links.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.