简体   繁体   English

使用Storm Crawler进行爬网

[英]Crawling using Storm Crawler

We are trying to implement Storm Crawler to crawl data. 我们正在尝试实现Storm Crawler来爬网数据。 We have been able to find sub-links from an url but we want to get contents from those sublinks. 我们已经能够从URL中找到子链接,但是我们想从这些子链接中获取内容。 I have not been able find much resources which would guide me how to get it? 我找不到足够的资源来指导我如何获得它? Any useful links/websites in this regard would be helpful. 在这方面任何有用的链接/网站都将有所帮助。 Thanks. 谢谢。

Getting Started , presentations and talks , as well as the various blog posts should be useful. 入门演示文稿和讲座以及各种博客文章都应该有用。

If the sublinks are fetched and parsed - which you can check in the logs, then the content will be available for indexing or storing eg as WARC. 如果获取并解析了子链接(可以在日志中检入),则内容将可用于索引或存储,例如作为WARC。 There is a dummy indexer which dumps the content to the console which can be taken as a starting point, alternatively there are resources for indexing the documents in Elasticsearch or SOLR. 有一个虚拟的索引器 ,可以将内容转储到控制台,这可以作为起点,或者有一些资源可以在Elasticsearch或SOLR中为文档建立索引。 The WARC module can be used to store the content of pages as well. WARC模块也可用于存储页面内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM