简体   繁体   中英

Crawling using Storm Crawler

We are trying to implement Storm Crawler to crawl data. We have been able to find sub-links from an url but we want to get contents from those sublinks. I have not been able find much resources which would guide me how to get it? Any useful links/websites in this regard would be helpful. Thanks.

Getting Started , presentations and talks , as well as the various blog posts should be useful.

If the sublinks are fetched and parsed - which you can check in the logs, then the content will be available for indexing or storing eg as WARC. There is a dummy indexer which dumps the content to the console which can be taken as a starting point, alternatively there are resources for indexing the documents in Elasticsearch or SOLR. The WARC module can be used to store the content of pages as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM