简体   繁体   English

使用Storm爬虫对每个域的不同设置(例如速度)进行特定于域的爬网

[英]Domain-specific crawling with different settings for each domain (e.g. speed) using Storm crawler

I have discovered Storm crawler only recently and from the past experience and studies and work with different crawlers I find this project based on Apache Storm pretty robust and suitable for many use cases and scenarios. 我最近才发现Storm爬虫,从过去的经验和研究以及与不同的爬虫一起工作,我发现这个基于Apache Storm的项目非常强大,适用于许多用例和场景。

I have read some tutorials and tested the storm crawler with some basic setup. 我已经阅读了一些教程并使用一些基本设置测试了风暴爬虫。 I would like to use the crawler in my project but there are certain things I am not sure if the crawler is capable of doing or even if it is suitable for such use cases. 我想在我的项目中使用爬虫,但有些事情我不确定爬虫是否能够做到,或者即使它适合这种用例。

I would like to do small and large recursive crawls on many web domains with specific speed settings and limit to the number of fetched urls. 我想在许多具有特定速度设置的Web域上进行小型和大型递归爬网,并限制获取的URL的数量。 The crawls can be started separately at any time with different settings (different speed, ignoring robots.txt for that domain, ignoring external links). 可以随时使用不同的设置(不同的速度,忽略该域的robots.txt,忽略外部链接)单独启动爬网。

Questions: 问题:

  • Is the storm crawler suitable for such scenario? 风暴爬虫是否适合这种情况?
  • Can I set the limit to the maximum number of pages fetched by the crawler? 我可以将限制设置为爬网程序获取的最大页数吗?
  • Can I set the limits to the number of fetched pages for different domains? 我可以为不同域的已获取页面数设置限制吗?
  • Can I monitor the progress of the crawl for specific domains separately? 我可以单独监控特定域的爬网进度吗?
  • Can I set the settings dynamically without the need of uploading modified topology to storm? 我可以动态设置设置,而无需将修改后的拓扑上传到风暴中吗?
  • Is it possible to pause or stop crawling (for specific domain)? 是否可以暂停或停止抓取(针对特定域)?
  • Is usually storm crawler running as one deployed topology? 通常风暴爬虫是作为一个部署的拓扑运行的吗?

I assume that for some of these questions the answer may be in customizing or writing my own bolts or spouts. 我认为对于其中一些问题,答案可能是定制或编写我自己的螺栓或喷口。 But I would rather avoid modifying Fetcher Bolt or main logic of the crawler as that would mean I am developing another crawler. 但我宁愿避免修改Fetcher Bolt或爬虫的主要逻辑,因为这意味着我正在开发另一个爬虫。

Thank you. 谢谢。

Glad you like StormCrawler 很高兴你喜欢StormCrawler

  • Is the storm crawler suitable for such scenario? 风暴爬虫是否适合这种情况?

Probably but you'd need to modify/customise a few things. 可能你需要修改/定制一些东西。

  • Can I set the limit to the maximum number of pages fetched by the crawler? 我可以将限制设置为爬网程序获取的最大页数吗?

You can currently set a limit on the depth from the seeds and have a different value per seed. 您现在可以设置种子深度的限制,并为每个种子设置不同的值。

There is no mechanism for filtering globally based on the number of URLs but this could be done. 没有基于URL数量进行全局过滤的机制,但这可以完成。 It depends on what you use to store the URL status and the corresponding spout and status updater implementations. 这取决于您用于存储URL状态的内容以及相应的spout和状态更新程序实现。 For instance, if you were using Elasticsearch for storing the URLs, you could have a URL filter check the number of URLs in the index and filter URLs (existing or not) based on that. 例如,如果您使用Elasticsearch存储URL,则可以使用URL过滤器检查索引中的URL数量,并根据该过滤器过滤URL(现有与否)。

  • Can I set the limits to the number of fetched pages for different domains? 我可以为不同域的已获取页面数设置限制吗?

You could specialize the solution proposed above and query per domain or host for the number of URLs already known. 您可以专门化上面提出的解决方案,并根据已知的URL数查询每个域或主机。 Doing this would not require any modifications to the core elements, just a custom URL filter. 这样做不需要对核心元素进行任何修改,只需要自定义URL过滤器。

  • Can I monitor the progress of the crawl for specific domains separately? 我可以单独监控特定域的爬网进度吗?

Again, it depends on what you use as a back end. 同样,这取决于您使用什么作为后端。 With Elasticsearch for instance, you can use Kibana to see the URLs per domain. 例如,使用Elasticsearch,您可以使用Kibana查看每个域的URL。

  • Can I set the settings dynamically without the need of uploading modified topology to storm? 我可以动态设置设置,而无需将修改后的拓扑上传到风暴中吗?

No. The configuration is read when the worker tasks are started. 否。启动工作任务时将读取配置。 I know of some users who wrote a custom configuration implementation backed by a DB table and got their components to read from that but this meant modifying a lot of code. 我知道一些用户编写了一个由DB表支持的自定义配置实现,并让他们的组件从中读取,但这意味着修改了很多代码。

  • Is it possible to pause or stop crawling (for specific domain)? 是否可以暂停或停止抓取(针对特定域)?

Not on a per domain basis but you could add an intermediate bolt to check whether a domain should be processed or not. 不是基于每个域,但您可以添加一个中间螺栓来检查域是否应该被处理。 If not you could simply fail the ack. 如果不是,你可能只是失败了。 This depends on the status storage again. 这取决于状态存储再次。 You could also add a custom filter to the ES spouts for instance and a field in the status index. 您还可以为ES spouts添加自定义过滤器,并在状态索引中添加一个字段。 Whenever the crawl should be halted for a specific domain, you could eg modify the value of the field for all the URLs matching a particular domain. 每当针对特定域停止爬网时,您可以例如修改与特定域匹配的所有URL的字段值。

  • Is usually storm crawler running as one deployed topology? 通常风暴爬虫是作为一个部署的拓扑运行的吗?

Yes, often. 是的,经常。

  • I assume that for some of these questions the answer may be in customizing or writing my own bolts or spouts. 我认为对于其中一些问题,答案可能是定制或编写我自己的螺栓或喷口。 But I would rather avoid modifying Fetcher Bolt or main logic of the crawler as that would mean I am developing another crawler. 但我宁愿避免修改Fetcher Bolt或爬虫的主要逻辑,因为这意味着我正在开发另一个爬虫。

StormCrawler is very modular so there is always several ways of doing things ;-) StormCrawler是非常模块化的,所以总有几种做事方式;-)

I am pretty sure you could have the behavior you want while having a single topology by modifying small non-core parts. 我很确定通过修改小的非核心部分,您可以拥有单个拓扑所需的行为。 If more essential parts of the code (eg per seed robots settings) are needed, then we'd probably want to add that to the code - you contributions would be very welcome. 如果需要代码的更重要部分(例如,每个种子机器人设置),那么我们可能希望将其添加到代码中 - 您的贡献将非常受欢迎。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Groovy OpenCSV 读取带反斜杠的值(例如 "domain\\user\u0026quot; ) - Groovy OpenCSV read value with backslash (e.g. "domain\user" ) 将特定于域的对象传递到jBPM 6工作台中的特定于域的任务 - Passing domain-specific objects to a domain-specific task in jBPM 6 workbench 使用xtext-xbase-xtend工具链创建简单的特定于域的语言 - creating a simple domain-specific language using xtext-xbase-xtend toolchain 为Android提供假冒的速度(例如Vehicle) - Provide Android with fake speed (e.g. Vehicle) 从 Eric Evans 的 DDD 书中了解一些特定于银行领域的示例 - Understanding some banking domain-specific examples from Eric Evans's DDD book MQ发布/订阅特定于域的接口通常比点对点快吗? - Is MQ publish/subscribe domain-specific interface generally faster than point-to-point? activiti是否允许用户定义特定于域的扩展,例如jBPM? - Does activiti allow users to define domain-specific extensions like jBPM? 如何将Flogger fluent API扩展为特定于域的属性? - How does one extend the Flogger fluent API for domain-specific attributes? 如何用特定于域的对象填充DeserializationContext,以便可以在反序列化期间使用它们? - How to populate a DeserializationContext with domain-specific objects so they can be used during deserialisation? 使用Java CLASS Day查找自特定日期(例如生日)以来的天数? - Using the Java CLASS Day to find number of days since specific date e.g. birthday?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM