[英]Crawling using Storm Crawler
We are trying to implement Storm Crawler to crawl data. 我们正在尝试实现Storm Crawler来爬网数据。 We have been able to find sub-links from an url but we want to get contents from those sublinks.
我们已经能够从URL中找到子链接,但是我们想从这些子链接中获取内容。 I have not been able find much resources which would guide me how to get it?
我找不到足够的资源来指导我如何获得它? Any useful links/websites in this regard would be helpful.
在这方面任何有用的链接/网站都将有所帮助。 Thanks.
谢谢。
Getting Started , presentations and talks , as well as the various blog posts should be useful. 入门 , 演示文稿和讲座以及各种博客文章都应该有用。
If the sublinks are fetched and parsed - which you can check in the logs, then the content will be available for indexing or storing eg as WARC. 如果获取并解析了子链接(可以在日志中检入),则内容将可用于索引或存储,例如作为WARC。 There is a dummy indexer which dumps the content to the console which can be taken as a starting point, alternatively there are resources for indexing the documents in Elasticsearch or SOLR.
有一个虚拟的索引器 ,可以将内容转储到控制台,这可以作为起点,或者有一些资源可以在Elasticsearch或SOLR中为文档建立索引。 The WARC module can be used to store the content of pages as well.
WARC模块也可用于存储页面内容。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.