Crawling PDF's with Crawler4j

Question

i currently using crawler4j to crawl a website and return the page url's and that pages parent page url too. i am using the basic crawler which is working fine except it is not returning the PDF's. i know it crawling the PDF's because i have checked what it crawling before the filter is added and the pdf's are showing. the PDF's seem to disappear/skipped when it enters

public void visit(Page page) {

i have no clue why it is doing this. Can anyone help me with this? it would be greatly appreciated! thanks

Answer 1

This is extremely timely, I am actually working on the same problem today and ran into the exact same issue. I'm returning true in shouldVisit for PDF urls, however I wasn't seeing them show up in the visit(Page page) like you. I traced the source to the CrawlConfig:

config.setIncludeBinaryContentInCrawling(true)

Setting that to true will cause the PDFs to show up in the visit method. Though it looks like reading the binary data will have to be done on the implementor's side with either Apache PDFBox or Apache Tika (or some other PDF lib). Hope this helps.

Crawling PDF's with Crawler4j

Question

1 answers

solution1
3 2014-08-13 19:55:38

Crawling PDF's with Crawler4j

Question

1 answers

solution1 3 2014-08-13 19:55:38

solution1
3 2014-08-13 19:55:38