简体繁体中英

Get all Pdf file Urls with Nutch2

原文 2018-03-02 14:06:48 2 1 mongodb/ apache/ web-crawler/ nutch

I am using Nutch 2.3.1 with MongoDB for persistence. My goal is to extract the file URLs without downloading them.

Right now it is downloading the file. How could I disable the download and persist the URL only in the database?

How could I extract all crawled URLs from Nutch2?

1 answers

Depending on what you want to accomplish, this may require some modifications:

If you don't want to parse/extract text from the PDF files, then you could set a value low for the http.content.limit which will basically prevent Nutch from downloading more that the bytes that you specify there, but still will be able to discover the URLs of the files, and will download a fragment (the number of bytes that you specify).

Of course this will also affect the rest of the URLs that you want to fetch/download.

One way to go could be to write your own protocol plugin that will prevent you from downloading any PDF file.

Apache nutch inject urls

RuntimeException when nutch generate

Search stack components with Nutch

Nutch not working on Windows 10

Nutch error in Eclipse

get all data in a mongo collection into a text file using javascript

PDFTron : - how to get pdf file from gridfs (mongodb), add watermark in it and send it to client?

Nutch REST api Results (limited)

python get method not exporting all data in csv file

How to get to the Mongo shell history file or all history on Windows

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Apache nutch inject urls RuntimeException when nutch generate Search stack components with Nutch Nutch not working on Windows 10 Nutch error in Eclipse get all data in a mongo collection into a text file using javascript PDFTron : - how to get pdf file from gridfs (mongodb), add watermark in it and send it to client? Nutch REST api Results (limited) python get method not exporting all data in csv file How to get to the Mongo shell history file or all history on Windows

Related Tags

Get all Pdf file Urls with Nutch2

Question

1 answers

solution1 1 ACCPTED 2018-03-12 17:26:36

solution1
1 ACCPTED 2018-03-12 17:26:36