簡體 English 中英

使用Crawler4j抓取PDF

[英]Crawling PDF's with Crawler4j

原文 2014-08-13 16:44:15 5 1 html/ url/ pdf/ web-crawler/ crawler4j

我目前正在使用crawler4j來抓取網站並返回頁面網址和頁面父網頁網址。 我使用的基本爬蟲工作正常，除了它沒有返回PDF。 我知道它爬行PDF，因為我已經檢查了在添加過濾器和pdf顯示之前它爬行的內容。 當PDF進入時，它似乎消失/跳過

public void visit（Page page）{

我不知道為什么這樣做。 誰能幫我這個？ 這將不勝感激！ 謝謝

1 個解決方案

這是非常及時的，我實際上正在處理同樣的問題，並遇到了完全相同的問題。 我在shouldVisit中為PDF網址返回true，但是我沒有看到它們像你一樣出現在訪問（頁面頁面）中。 我將源跟蹤到了CrawlConfig：

config.setIncludeBinaryContentInCrawling(true)

將其設置為true將導致PDF顯示在visit方法中。 雖然看起來像讀取二進制數據必須在實現者身上使用Apache PDFBox或Apache Tika（或其他一些PDF庫）來完成。 希望這可以幫助。

使用crawler4j爬行網站時獲取鏈接的鏈接文本

[英]Get link text of links when crawling a website using crawler4j

確定crawler4j的參數

[英]Determining parameters on crawler4j

如何使用crawler4j提取頁面上的所有鏈接？

[英]How to extract all links on a page using crawler4j?

使用crawler4j獲取html頁面中存在的所有iframe，base64代碼

[英]Getting all iframes,base64 codes which are present in html pages using crawler4j

Twitter Python爬網程序的爬網機制問題

[英]Problem with Crawling Mechanism of Twitter Python Crawler

如何在使用 Storm-crawler 抓取 web 頁面時排除具有 id/class、Header 和頁腳部分的 HTMl 的特定 DIV？

[英]How to exclude particular DIV of HTMl with id/class, Header and footer sections while crawling web pages using storm-crawler?

使用python的urllib2和Beautifulsoup搜尋Wikipedia時刪除html標簽

[英]Removing html tags when crawling wikipedia with python's urllib2 and Beautifulsoup

動態PDF與HTML轉換為PDF

[英]dynamic PDF v/s HTML to PDF

將div的內容導出為PDF

[英]Exporting a div's content to PDF

列出PDF來自目錄

[英]List PDF's From Directory

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 使用crawler4j爬行網站時獲取鏈接的鏈接文本確定crawler4j的參數如何使用crawler4j提取頁面上的所有鏈接？使用crawler4j獲取html頁面中存在的所有iframe，base64代碼 Twitter Python爬網程序的爬網機制問題如何在使用 Storm-crawler 抓取 web 頁面時排除具有 id/class、Header 和頁腳部分的 HTMl 的特定 DIV？使用python的urllib2和Beautifulsoup搜尋Wikipedia時刪除html標簽動態PDF與HTML轉換為PDF 將div的內容導出為PDF 列出PDF來自目錄

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM