简体繁体 English

使用Storm爬虫对每个域的不同设置（例如速度）进行特定于域的爬网

[英]Domain-specific crawling with different settings for each domain (e.g. speed) using Storm crawler

原文 2017-05-22 17:48:18 8 2 java/ web-crawler/ apache-storm/ stormcrawler

I have discovered Storm crawler only recently and from the past experience and studies and work with different crawlers I find this project based on Apache Storm pretty robust and suitable for many use cases and scenarios. 我最近才发现Storm爬虫，从过去的经验和研究以及与不同的爬虫一起工作，我发现这个基于Apache Storm的项目非常强大，适用于许多用例和场景。

I have read some tutorials and tested the storm crawler with some basic setup. 我已经阅读了一些教程并使用一些基本设置测试了风暴爬虫。 I would like to use the crawler in my project but there are certain things I am not sure if the crawler is capable of doing or even if it is suitable for such use cases. 我想在我的项目中使用爬虫，但有些事情我不确定爬虫是否能够做到，或者即使它适合这种用例。

I would like to do small and large recursive crawls on many web domains with specific speed settings and limit to the number of fetched urls. 我想在许多具有特定速度设置的Web域上进行小型和大型递归爬网，并限制获取的URL的数量。 The crawls can be started separately at any time with different settings (different speed, ignoring robots.txt for that domain, ignoring external links). 可以随时使用不同的设置（不同的速度，忽略该域的robots.txt，忽略外部链接）单独启动爬网。

Questions: 问题：

Is the storm crawler suitable for such scenario? 风暴爬虫是否适合这种情况？
Can I set the limit to the maximum number of pages fetched by the crawler? 我可以将限制设置为爬网程序获取的最大页数吗？
Can I set the limits to the number of fetched pages for different domains? 我可以为不同域的已获取页面数设置限制吗？
Can I monitor the progress of the crawl for specific domains separately? 我可以单独监控特定域的爬网进度吗？
Can I set the settings dynamically without the need of uploading modified topology to storm? 我可以动态设置设置，而无需将修改后的拓扑上传到风暴中吗？
Is it possible to pause or stop crawling (for specific domain)? 是否可以暂停或停止抓取（针对特定域）？
Is usually storm crawler running as one deployed topology? 通常风暴爬虫是作为一个部署的拓扑运行的吗？

I assume that for some of these questions the answer may be in customizing or writing my own bolts or spouts. 我认为对于其中一些问题，答案可能是定制或编写我自己的螺栓或喷口。 But I would rather avoid modifying Fetcher Bolt or main logic of the crawler as that would mean I am developing another crawler. 但我宁愿避免修改Fetcher Bolt或爬虫的主要逻辑，因为这意味着我正在开发另一个爬虫。

Thank you. 谢谢。

2 个解决方案

You have very interesting questions. 你有非常有趣的问题。 I think you can discover more here: the code: https://github.com/DigitalPebble/storm-crawler oficial tutorial: http://stormcrawler.net/ and some responces: http://2015.berlinbuzzwords.de/sites/2015.berlinbuzzwords.de/files/media/documents/julien_nioche-low_latency_scalable_web_crawling_on_apache_storm.pdf 我想你可以在这里发现更多：代码： https ： //github.com/DigitalPebble/storm-crawler oficial tutorial： http ：//stormcrawler.net/，有些回复： http ： //2015.berlinbuzzwords.de/sites /2015.berlinbuzzwords.de/files/media/documents/julien_nioche-low_latency_scalable_web_crawling_on_apache_storm.pdf

Glad you like StormCrawler 很高兴你喜欢StormCrawler

Is the storm crawler suitable for such scenario? 风暴爬虫是否适合这种情况？

Probably but you'd need to modify/customise a few things. 可能你需要修改/定制一些东西。

Can I set the limit to the maximum number of pages fetched by the crawler? 我可以将限制设置为爬网程序获取的最大页数吗？

You can currently set a limit on the depth from the seeds and have a different value per seed. 您现在可以设置种子深度的限制，并为每个种子设置不同的值。

There is no mechanism for filtering globally based on the number of URLs but this could be done. 没有基于URL数量进行全局过滤的机制，但这可以完成。 It depends on what you use to store the URL status and the corresponding spout and status updater implementations. 这取决于您用于存储URL状态的内容以及相应的spout和状态更新程序实现。 For instance, if you were using Elasticsearch for storing the URLs, you could have a URL filter check the number of URLs in the index and filter URLs (existing or not) based on that. 例如，如果您使用Elasticsearch存储URL，则可以使用URL过滤器检查索引中的URL数量，并根据该过滤器过滤URL（现有与否）。

Can I set the limits to the number of fetched pages for different domains? 我可以为不同域的已获取页面数设置限制吗？

You could specialize the solution proposed above and query per domain or host for the number of URLs already known. 您可以专门化上面提出的解决方案，并根据已知的URL数查询每个域或主机。 Doing this would not require any modifications to the core elements, just a custom URL filter. 这样做不需要对核心元素进行任何修改，只需要自定义URL过滤器。

Can I monitor the progress of the crawl for specific domains separately? 我可以单独监控特定域的爬网进度吗？

Again, it depends on what you use as a back end. 同样，这取决于您使用什么作为后端。 With Elasticsearch for instance, you can use Kibana to see the URLs per domain. 例如，使用Elasticsearch，您可以使用Kibana查看每个域的URL。

Can I set the settings dynamically without the need of uploading modified topology to storm? 我可以动态设置设置，而无需将修改后的拓扑上传到风暴中吗？

No. The configuration is read when the worker tasks are started. 否。启动工作任务时将读取配置。 I know of some users who wrote a custom configuration implementation backed by a DB table and got their components to read from that but this meant modifying a lot of code. 我知道一些用户编写了一个由DB表支持的自定义配置实现，并让他们的组件从中读取，但这意味着修改了很多代码。

Is it possible to pause or stop crawling (for specific domain)? 是否可以暂停或停止抓取（针对特定域）？

Not on a per domain basis but you could add an intermediate bolt to check whether a domain should be processed or not. 不是基于每个域，但您可以添加一个中间螺栓来检查域是否应该被处理。 If not you could simply fail the ack. 如果不是，你可能只是失败了。 This depends on the status storage again. 这取决于状态存储再次。 You could also add a custom filter to the ES spouts for instance and a field in the status index. 您还可以为ES spouts添加自定义过滤器，并在状态索引中添加一个字段。 Whenever the crawl should be halted for a specific domain, you could eg modify the value of the field for all the URLs matching a particular domain. 每当针对特定域停止爬网时，您可以例如修改与特定域匹配的所有URL的字段值。

Is usually storm crawler running as one deployed topology? 通常风暴爬虫是作为一个部署的拓扑运行的吗？

Yes, often. 是的，经常。

I assume that for some of these questions the answer may be in customizing or writing my own bolts or spouts. 我认为对于其中一些问题，答案可能是定制或编写我自己的螺栓或喷口。 But I would rather avoid modifying Fetcher Bolt or main logic of the crawler as that would mean I am developing another crawler. 但我宁愿避免修改Fetcher Bolt或爬虫的主要逻辑，因为这意味着我正在开发另一个爬虫。

StormCrawler is very modular so there is always several ways of doing things ;-) StormCrawler是非常模块化的，所以总有几种做事方式;-)

I am pretty sure you could have the behavior you want while having a single topology by modifying small non-core parts. 我很确定通过修改小的非核心部分，您可以拥有单个拓扑所需的行为。 If more essential parts of the code (eg per seed robots settings) are needed, then we'd probably want to add that to the code - you contributions would be very welcome. 如果需要代码的更重要部分（例如，每个种子机器人设置），那么我们可能希望将其添加到代码中 - 您的贡献将非常受欢迎。