簡體 English 中英

使用Nutch2獲取所有Pdf文件Urls

[英]Get all Pdf file Urls with Nutch2

原文 2018-03-02 14:06:48 9 1 mongodb/ apache/ web-crawler/ nutch

我將Nutch 2.3.1與MongoDB結合使用來實現持久性。 我的目標是提取文件URL而不下載它們。

現在正在下載文件。 如何禁用下載並僅將URL保留在數據庫中？

如何從Nutch2中提取所有已爬網的URL？

1 個解決方案

根據您要完成的工作，這可能需要一些修改：

如果您不想從PDF文件中解析/提取文本，則可http.content.limit設置一個較低的值，這將基本上阻止Nutch下載超過您在此處指定的字節的文件，但仍然會能夠發現文件的URL，並會下載一個片段（您指定的字節數）。

當然，這也會影響您要獲取/下載的其余URL。

一種可行的方法是編寫您自己的協議插件，該插件將阻止您下載任何PDF文件。

Apache Nutject注入網址

[英]Apache nutch inject urls

小結生成時出現RuntimeException

[英]RuntimeException when nutch generate

使用Nutch搜索堆棧組件

[英]Search stack components with Nutch

Nutch無法在Windows 10上運行

[英]Nutch not working on Windows 10

Eclipse中的Nutch錯誤

[英]Nutch error in Eclipse

使用javascript將mongo集合中的所有數據轉換為文本文件

[英]get all data in a mongo collection into a text file using javascript

PDFTron : - 如何從 gridfs (mongodb) 獲取 pdf 文件，在其中添加水印並將其發送給客戶端？

[英]PDFTron : - how to get pdf file from gridfs (mongodb), add watermark in it and send it to client?

Nutch REST API結果（有限）

[英]Nutch REST api Results (limited)

python get方法未導出csv文件中的所有數據

[英]python get method not exporting all data in csv file

如何在Windows上訪問Mongo shell歷史文件或所有歷史記錄

[英]How to get to the Mongo shell history file or all history on Windows

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 Apache Nutject注入網址小結生成時出現RuntimeException 使用Nutch搜索堆棧組件 Nutch無法在Windows 10上運行 Eclipse中的Nutch錯誤使用javascript將mongo集合中的所有數據轉換為文本文件 PDFTron : - 如何從 gridfs (mongodb) 獲取 pdf 文件，在其中添加水印並將其發送給客戶端？ Nutch REST API結果（有限） python get方法未導出csv文件中的所有數據如何在Windows上訪問Mongo shell歷史文件或所有歷史記錄

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM