简体   繁体   English

使用Nutch2获取所有Pdf文件Urls

[英]Get all Pdf file Urls with Nutch2

I am using Nutch 2.3.1 with MongoDB for persistence. 我将Nutch 2.3.1与MongoDB结合使用来实现持久性。 My goal is to extract the file URLs without downloading them. 我的目标是提取文件URL而不下载它们。

Right now it is downloading the file. 现在正在下载文件。 How could I disable the download and persist the URL only in the database? 如何禁用下载并仅将URL保留在数据库中?

How could I extract all crawled URLs from Nutch2? 如何从Nutch2中提取所有已爬网的URL?

Depending on what you want to accomplish, this may require some modifications: 根据您要完成的工作,这可能需要一些修改:

If you don't want to parse/extract text from the PDF files, then you could set a value low for the http.content.limit which will basically prevent Nutch from downloading more that the bytes that you specify there, but still will be able to discover the URLs of the files, and will download a fragment (the number of bytes that you specify). 如果您不想从PDF文件中解析/提取文本,则可http.content.limit设置一个较低的值,这将基本上阻止Nutch下载超过您在此处指定的字节的文件,但仍然会能够发现文件的URL,并会下载一个片段(您指定的字节数)。

Of course this will also affect the rest of the URLs that you want to fetch/download. 当然,这也会影响您要获取/下载的其余URL。

One way to go could be to write your own protocol plugin that will prevent you from downloading any PDF file. 一种可行的方法是编写您自己的协议插件,该插件将阻止您下载任何PDF文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM