简体   繁体   中英

Get all Pdf file Urls with Nutch2

I am using Nutch 2.3.1 with MongoDB for persistence. My goal is to extract the file URLs without downloading them.

Right now it is downloading the file. How could I disable the download and persist the URL only in the database?

How could I extract all crawled URLs from Nutch2?

Depending on what you want to accomplish, this may require some modifications:

If you don't want to parse/extract text from the PDF files, then you could set a value low for the http.content.limit which will basically prevent Nutch from downloading more that the bytes that you specify there, but still will be able to discover the URLs of the files, and will download a fragment (the number of bytes that you specify).

Of course this will also affect the rest of the URLs that you want to fetch/download.

One way to go could be to write your own protocol plugin that will prevent you from downloading any PDF file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM