简体繁体 English

使用Nutch2获取所有Pdf文件Urls

[英]Get all Pdf file Urls with Nutch2

原文 2018-03-02 14:06:48 7 1 mongodb/ apache/ web-crawler/ nutch

I am using Nutch 2.3.1 with MongoDB for persistence. 我将Nutch 2.3.1与MongoDB结合使用来实现持久性。 My goal is to extract the file URLs without downloading them. 我的目标是提取文件URL而不下载它们。

Right now it is downloading the file. 现在正在下载文件。 How could I disable the download and persist the URL only in the database? 如何禁用下载并仅将URL保留在数据库中？

How could I extract all crawled URLs from Nutch2? 如何从Nutch2中提取所有已爬网的URL？

1 个解决方案

Depending on what you want to accomplish, this may require some modifications: 根据您要完成的工作，这可能需要一些修改：

If you don't want to parse/extract text from the PDF files, then you could set a value low for the http.content.limit which will basically prevent Nutch from downloading more that the bytes that you specify there, but still will be able to discover the URLs of the files, and will download a fragment (the number of bytes that you specify). 如果您不想从PDF文件中解析/提取文本，则可http.content.limit设置一个较低的值，这将基本上阻止Nutch下载超过您在此处指定的字节的文件，但仍然会能够发现文件的URL，并会下载一个片段（您指定的字节数）。

Of course this will also affect the rest of the URLs that you want to fetch/download. 当然，这也会影响您要获取/下载的其余URL。

One way to go could be to write your own protocol plugin that will prevent you from downloading any PDF file. 一种可行的方法是编写您自己的协议插件，该插件将阻止您下载任何PDF文件。

Apache Nutject注入网址 - Apache nutch inject urls

小结生成时出现RuntimeException - RuntimeException when nutch generate

使用Nutch搜索堆栈组件 - Search stack components with Nutch

Nutch无法在Windows 10上运行 - Nutch not working on Windows 10

Eclipse中的Nutch错误 - Nutch error in Eclipse

使用javascript将mongo集合中的所有数据转换为文本文件 - get all data in a mongo collection into a text file using javascript

PDFTron : - 如何从 gridfs (mongodb) 获取 pdf 文件，在其中添加水印并将其发送给客户端？ - PDFTron : - how to get pdf file from gridfs (mongodb), add watermark in it and send it to client?

Nutch REST API结果（有限） - Nutch REST api Results (limited)

python get方法未导出csv文件中的所有数据 - python get method not exporting all data in csv file

如何在Windows上访问Mongo shell历史文件或所有历史记录 - How to get to the Mongo shell history file or all history on Windows

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache Nutject注入网址 - Apache nutch inject urls 小结生成时出现RuntimeException - RuntimeException when nutch generate 使用Nutch搜索堆栈组件 - Search stack components with Nutch Nutch无法在Windows 10上运行 - Nutch not working on Windows 10 Eclipse中的Nutch错误 - Nutch error in Eclipse 使用javascript将mongo集合中的所有数据转换为文本文件 - get all data in a mongo collection into a text file using javascript PDFTron : - 如何从 gridfs (mongodb) 获取 pdf 文件，在其中添加水印并将其发送给客户端？ - PDFTron : - how to get pdf file from gridfs (mongodb), add watermark in it and send it to client? Nutch REST API结果（有限） - Nutch REST api Results (limited) python get方法未导出csv文件中的所有数据 - python get method not exporting all data in csv file 如何在Windows上访问Mongo shell历史文件或所有历史记录 - How to get to the Mongo shell history file or all history on Windows

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM