简体繁体 English

使用Storm Crawler进行爬网

[英]Crawling using Storm Crawler

原文 2016-12-28 09:29:16 0 1 web-crawler/ apache-storm/ stormcrawler

We are trying to implement Storm Crawler to crawl data. 我们正在尝试实现Storm Crawler来爬网数据。 We have been able to find sub-links from an url but we want to get contents from those sublinks. 我们已经能够从URL中找到子链接，但是我们想从这些子链接中获取内容。 I have not been able find much resources which would guide me how to get it? 我找不到足够的资源来指导我如何获得它？ Any useful links/websites in this regard would be helpful. 在这方面任何有用的链接/网站都将有所帮助。 Thanks. 谢谢。

1 个解决方案

Getting Started , presentations and talks , as well as the various blog posts should be useful. 入门，演示文稿和讲座以及各种博客文章都应该有用。

If the sublinks are fetched and parsed - which you can check in the logs, then the content will be available for indexing or storing eg as WARC. 如果获取并解析了子链接（可以在日志中检入），则内容将可用于索引或存储，例如作为WARC。 There is a dummy indexer which dumps the content to the console which can be taken as a starting point, alternatively there are resources for indexing the documents in Elasticsearch or SOLR. 有一个虚拟的索引器，可以将内容转储到控制台，这可以作为起点，或者有一些资源可以在Elasticsearch或SOLR中为文档建立索引。 The WARC module can be used to store the content of pages as well. WARC模块也可用于存储页面内容。

使用Storm爬虫对每个域的不同设置（例如速度）进行特定于域的爬网 - Domain-specific crawling with different settings for each domain (e.g. speed) using Storm crawler

Storm Crawler-搜寻需要认证的网站 - Storm Crawler- Crawling the websites which require authentication

使用crawler4j爬行和提取信息 - Crawling and extracting info using crawler4j

是否可以使用Java搜寻器crawler4j暂停和恢复搜寻？ - Is it possible to pause and resume crawling using Java crawler crawler4j?

可以在爬网过程中配置风暴搜寻器以将主机网址添加到网址路由的开头吗？ - Can I configure storm crawler to add the host url to the front of the url route during crawling?

在 Storm Crawler 中抓取特定基本 URL 的所有子 URL 的完成事件 - completion event of crawling all of the sub URLs for specific base URL in Storm Crawler

使用“ npm爬行器”爬行 - Crawling with “npm crawler”

调试Storm Crawler - Debugging Storm Crawler

使用Storm搜寻器可以同时运行多少个搜寻器 - How many Crawlers can run simultaneously using storm crawler

Puppeteer Crawler 大型爬行 - Puppeteer Crawler large scale crawling

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Storm爬虫对每个域的不同设置（例如速度）进行特定于域的爬网 - Domain-specific crawling with different settings for each domain (e.g. speed) using Storm crawler Storm Crawler-搜寻需要认证的网站 - Storm Crawler- Crawling the websites which require authentication 使用crawler4j爬行和提取信息 - Crawling and extracting info using crawler4j 是否可以使用Java搜寻器crawler4j暂停和恢复搜寻？ - Is it possible to pause and resume crawling using Java crawler crawler4j? 可以在爬网过程中配置风暴搜寻器以将主机网址添加到网址路由的开头吗？ - Can I configure storm crawler to add the host url to the front of the url route during crawling? 在 Storm Crawler 中抓取特定基本 URL 的所有子 URL 的完成事件 - completion event of crawling all of the sub URLs for specific base URL in Storm Crawler 使用“ npm爬行器”爬行 - Crawling with “npm crawler” 调试Storm Crawler - Debugging Storm Crawler 使用Storm搜寻器可以同时运行多少个搜寻器 - How many Crawlers can run simultaneously using storm crawler Puppeteer Crawler 大型爬行 - Puppeteer Crawler large scale crawling

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM