简体繁体中英

fetching image in StormCrawler without indexing them in status

原文 2021-06-23 04:00:57 1 1 web-crawler/ stormcrawler

I want to download all images in the web pages and feeding them to some machine learning algorithm for classification and extracting objects within those images. I do not want to index them in the status collection, but I want to extract them in JsoupParser bolt, omit their addresses and download them in topology and feed them to some computer vision algorithm. Is it possible in the StormCrawler?

1 answers

If you want to fetch them in the topology, they need to be in the status index. They obviously don't need to be in the content index as there is not text content to query against; you need to write a custom bolt to save the content of the images to whichever form of storage you want. If you run your crawls on EC2, then AWS S3 would be a good fit for example.

Definitely doable with StormCrawler, in fact several companies use it for that purpose.

Stormcrawler not fetching/indexing pages for elasticsearch

Stormcrawler not indexing content with Elasticsearch

Stormcrawler, the status index and re-crawling

Stormcrawler - how does the es.status.filterQuery work?

Stormcrawler: Injecting new URL to crawl without restarting the topology

StormCrawler settings

StormCrawler maven packaging error

Disable subdomain in flow stormcrawler

How can i crawl page but without fetching video/image content in nutch 2.1?

Is there any limit on redirects in StormCrawler?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Stormcrawler not fetching/indexing pages for elasticsearch Stormcrawler not indexing content with Elasticsearch Stormcrawler, the status index and re-crawling Stormcrawler - how does the es.status.filterQuery work? Stormcrawler: Injecting new URL to crawl without restarting the topology StormCrawler settings StormCrawler maven packaging error Disable subdomain in flow stormcrawler How can i crawl page but without fetching video/image content in nutch 2.1? Is there any limit on redirects in StormCrawler?

Related Tags

fetching image in StormCrawler without indexing them in status

Question

1 answers

solution1 0 2021-06-23 08:38:33

solution1
0 2021-06-23 08:38:33