[英]StormCrawler: best topology for cluster
我正在使用stormcrawler來抓取40k網站,max_depth = 2,並且我想盡快完成。 我有5個風暴節點(具有不同的靜態ips)和3個彈性節點。 現在,我最好的拓撲是:
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.CollapsingSpout"
parallelism: 10
bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 5
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 5
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 100
- id: "index"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
parallelism: 25
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 25
- id: "status_metrics"
className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
parallelism: 5
和搜尋器配置:
config:
topology.workers: 5
topology.message.timeout.secs: 300
topology.max.spout.pending: 250
topology.debug: false
fetcher.threads.number: 500
worker.heap.memory.mb: 4096
問題:1)我應該使用AggreationsSpout還是CollapsingSpout,有什么區別? 我嘗試了AggregationSpout,但是性能等於默認配置下1台計算機的性能。
2)這種並行配置正確嗎?
3)當我從1節點跳到5節點配置時,我發現“ FETCH ERROR”增加了約20%,並且很多站點無法正確獲取。 可能是什么原因?
更新:
ES-conf.yaml:
# configuration for Elasticsearch resources
config:
# ES indexer bolt
# adresses can be specified as a full URL
# if not we assume that the protocol is http and the port 9200
es.indexer.addresses: "1.1.1.1"
es.indexer.index.name: "index"
es.indexer.doc.type: "doc"
es.indexer.create: false
es.indexer.settings:
cluster.name: "webcrawler-cluster"
# ES metricsConsumer
es.metrics.addresses: "http://1.1.1.1:9200"
es.metrics.index.name: "metrics"
es.metrics.doc.type: "datapoint"
es.metrics.settings:
cluster.name: "webcrawler-cluster"
# ES spout and persistence bolt
es.status.addresses: "http://1.1.1.1:9200"
es.status.index.name: "status"
es.status.doc.type: "status"
#es.status.user: "USERNAME"
#es.status.password: "PASSWORD"
# the routing is done on the value of 'partition.url.mode'
es.status.routing: true
# stores the value used for the routing as a separate field
# needed by the spout implementations
es.status.routing.fieldname: "metadata.hostname"
es.status.bulkActions: 500
es.status.flushInterval: "5s"
es.status.concurrentRequests: 1
es.status.settings:
cluster.name: "webcrawler-cluster"
################
# spout config #
################
# positive or negative filter parsable by the Lucene Query Parser
# es.status.filterQuery: "-(metadata.hostname:stormcrawler.net)"
# time in secs for which the URLs will be considered for fetching after a ack of fail
es.status.ttl.purgatory: 30
# Min time (in msecs) to allow between 2 successive queries to ES
es.status.min.delay.queries: 2000
es.status.max.buckets: 50
es.status.max.urls.per.bucket: 2
# field to group the URLs into buckets
es.status.bucket.field: "metadata.hostname"
# field to sort the URLs within a bucket
es.status.bucket.sort.field: "nextFetchDate"
# field to sort the buckets
es.status.global.sort.field: "nextFetchDate"
# Delay since previous query date (in secs) after which the nextFetchDate value will be reset
es.status.reset.fetchdate.after: -1
# CollapsingSpout : limits the deep paging by resetting the start offset for the ES query
es.status.max.start.offset: 500
# AggregationSpout : sampling improves the performance on large crawls
es.status.sample: false
# AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
# use it as nextFetchDate
es.status.recentDate.increase: -1
es.status.recentDate.min.gap: -1
topology.metrics.consumer.register:
- class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
parallelism.hint: 1
#whitelist:
# - "fetcher_counter"
# - "fetcher_average.bytes_fetched"
#blacklist:
# - "__receive.*"
1)我應該使用AggreationsSpout還是CollapsingSpout,有什么區別? 我嘗試了AggregationSpout,但是性能等於默認配置下1台計算機的性能。
顧名思義,AggregationSpout使用聚合作為一種按主機(或域或IP或其他)對URL進行分組的機制,而CollapsingSpout使用collapsing 。 如果將其配置為每個存儲桶具有多個URL( es.status.max.urls.per.bucket ),則后者可能會變慢,因為它為每個存儲桶發出子查詢。 AggregationSpout應該具有良好的性能,尤其是在es.status.sample設置為true的情況下。 CollapsingSpouts在此階段處於實驗階段。
2)這種並行配置正確嗎?
這可能比需要的更多JSoupParserBolts。 實際上,與Fetcherbolts相比,比例為1:4很好,即使有500條取紗螺紋也是如此。 Storm UI對於發現瓶頸以及需要擴展的組件很有用。 其他所有內容看起來都不錯,但是實際上,您應該查看Storm UI和指標以將拓撲調整為適合您的爬網的最佳設置。
3)當我從1節點跳到5節點配置時,我發現“ FETCH ERROR”增加了約20%,並且很多站點無法正確獲取。 可能是什么原因?
這可能表明您已經飽和了網絡連接,但相反,當使用更多節點時則不應該如此。 也許使用Storm UI檢查FetcherBolts如何在節點上分布。 是一名工人在運行所有實例,還是所有實例都獲得相等的數量? 查看日志以查看會發生什么,例如是否有大量超時異常?
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.