StormCrawler：集群的最佳拓撲

Question

我正在使用stormcrawler來抓取40k網站，max_depth = 2，並且我想盡快完成。 我有5個風暴節點（具有不同的靜態ips）和3個彈性節點。 現在，我最好的拓撲是：

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.CollapsingSpout"
    parallelism: 10

bolts:
  - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 1
  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 5
  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 5
  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 100
  - id: "index"
    className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
    parallelism: 25
  - id: "status"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
    parallelism: 25
  - id: "status_metrics"
    className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
    parallelism: 5

和搜尋器配置：

config: 
  topology.workers: 5
  topology.message.timeout.secs: 300
  topology.max.spout.pending: 250
  topology.debug: false
  fetcher.threads.number: 500
  worker.heap.memory.mb: 4096

問題：1）我應該使用AggreationsSpout還是CollapsingSpout，有什么區別？ 我嘗試了AggregationSpout，但是性能等於默認配置下1台計算機的性能。

2）這種並行配置正確嗎？

3）當我從1節點跳到5節點配置時，我發現“ FETCH ERROR”增加了約20％，並且很多站點無法正確獲取。 可能是什么原因？

更新：

ES-conf.yaml：

# configuration for Elasticsearch resources

config:
  # ES indexer bolt
  # adresses can be specified as a full URL
  # if not we assume that the protocol is http and the port 9200
  es.indexer.addresses: "1.1.1.1"
  es.indexer.index.name: "index"
  es.indexer.doc.type: "doc"
  es.indexer.create: false
  es.indexer.settings:
    cluster.name: "webcrawler-cluster"

  # ES metricsConsumer
  es.metrics.addresses: "http://1.1.1.1:9200"
  es.metrics.index.name: "metrics"
  es.metrics.doc.type: "datapoint"
  es.metrics.settings:
    cluster.name: "webcrawler-cluster"

  # ES spout and persistence bolt
  es.status.addresses: "http://1.1.1.1:9200"
  es.status.index.name: "status"
  es.status.doc.type: "status"
  #es.status.user: "USERNAME"
  #es.status.password: "PASSWORD"
  # the routing is done on the value of 'partition.url.mode'
  es.status.routing: true
  # stores the value used for the routing as a separate field
  # needed by the spout implementations
  es.status.routing.fieldname: "metadata.hostname"
  es.status.bulkActions: 500
  es.status.flushInterval: "5s"
  es.status.concurrentRequests: 1
  es.status.settings:
    cluster.name: "webcrawler-cluster"

  ################
  # spout config #
  ################

  # positive or negative filter parsable by the Lucene Query Parser
  # es.status.filterQuery: "-(metadata.hostname:stormcrawler.net)"

  # time in secs for which the URLs will be considered for fetching after a ack of fail
  es.status.ttl.purgatory: 30

  # Min time (in msecs) to allow between 2 successive queries to ES
  es.status.min.delay.queries: 2000

  es.status.max.buckets: 50
  es.status.max.urls.per.bucket: 2
  # field to group the URLs into buckets
  es.status.bucket.field: "metadata.hostname"
  # field to sort the URLs within a bucket
  es.status.bucket.sort.field: "nextFetchDate"
  # field to sort the buckets
  es.status.global.sort.field: "nextFetchDate"

  # Delay since previous query date (in secs) after which the nextFetchDate value will be reset
  es.status.reset.fetchdate.after: -1

  # CollapsingSpout : limits the deep paging by resetting the start offset for the ES query 
  es.status.max.start.offset: 500

  # AggregationSpout : sampling improves the performance on large crawls
  es.status.sample: false

  # AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
  # use it as nextFetchDate
  es.status.recentDate.increase: -1
  es.status.recentDate.min.gap: -1

  topology.metrics.consumer.register:
       - class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
         parallelism.hint: 1
         #whitelist:
         #  - "fetcher_counter"
         #  - "fetcher_average.bytes_fetched"
         #blacklist:
         #  - "__receive.*"

Answer 1

1）我應該使用AggreationsSpout還是CollapsingSpout，有什么區別？ 我嘗試了AggregationSpout，但是性能等於默認配置下1台計算機的性能。

顧名思義，AggregationSpout使用聚合作為一種按主機（或域或IP或其他）對URL進行分組的機制，而CollapsingSpout使用collapsing 。 如果將其配置為每個存儲桶具有多個URL（ es.status.max.urls.per.bucket ），則后者可能會變慢，因為它為每個存儲桶發出子查詢。 AggregationSpout應該具有良好的性能，尤其是在es.status.sample設置為true的情況下。 CollapsingSpouts在此階段處於實驗階段。

2）這種並行配置正確嗎？

這可能比需要的更多JSoupParserBolts。 實際上，與Fetcherbolts相比，比例為1：4很好，即使有500條取紗螺紋也是如此。 Storm UI對於發現瓶頸以及需要擴展的組件很有用。 其他所有內容看起來都不錯，但是實際上，您應該查看Storm UI和指標以將拓撲調整為適合您的爬網的最佳設置。

3）當我從1節點跳到5節點配置時，我發現“ FETCH ERROR”增加了約20％，並且很多站點無法正確獲取。 可能是什么原因？

這可能表明您已經飽和了網絡連接，但相反，當使用更多節點時則不應該如此。 也許使用Storm UI檢查FetcherBolts如何在節點上分布。 是一名工人在運行所有實例，還是所有實例都獲得相等的數量？ 查看日志以查看會發生什么，例如是否有大量超時異常？

StormCrawler：集群的最佳拓撲

問題描述

1 個解決方案

解決方案1
1 已采納 2018-05-29 13:12:38

StormCrawler：集群的最佳拓撲

問題描述

1 個解決方案

解決方案1 1 已采納 2018-05-29 13:12:38

解決方案1
1 已采納 2018-05-29 13:12:38