简体   繁体   中英

fetching image in StormCrawler without indexing them in status

I want to download all images in the web pages and feeding them to some machine learning algorithm for classification and extracting objects within those images. I do not want to index them in the status collection, but I want to extract them in JsoupParser bolt, omit their addresses and download them in topology and feed them to some computer vision algorithm. Is it possible in the StormCrawler?

If you want to fetch them in the topology, they need to be in the status index. They obviously don't need to be in the content index as there is not text content to query against; you need to write a custom bolt to save the content of the images to whichever form of storage you want. If you run your crawls on EC2, then AWS S3 would be a good fit for example.

Definitely doable with StormCrawler, in fact several companies use it for that purpose.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM