[英]Nested iterators for em-synchrony in Ruby
我正在嘗試使用eventmachine和em-synchrony編寫解析器(解析郵政編碼的街道和房屋)。 問題是我要解析的網站有嵌套結構 - 對於每個郵政編碼,有很多街道頁面,其中有分頁。 所以算法非常簡單:
這是一個這樣的解析器的例子(它的工作原理):
require "nokogiri"
require "em-synchrony"
require "em-synchrony/em-http"
def url page = nil
url = "http://gistflow.com/all"
url << "?page=#{page}" if page
url
end
EM.synchrony do
concurrency = 2
# here [1] is array of index pages, for this template let it be just [1]
results = EM::Synchrony::Iterator.new([1], concurrency).map do |index, iter|
index_page = EM::HttpRequest.new(url).aget
index_page.callback do
# here we make some parsing and find out wheter index page
# has pagination. The worst case is that it has pagination
pages = [2,3,4,5]
unless pages.empty?
# here we need to parse all pages
# with urls like url(page)
# how can I do it more efficiently?
end
iter.return "SUCC #{index}"
end
index_page.errback do
iter.return "ERR #{index}"
end
end
p results
EM.stop
end
所以訣竅是在這個塊里面:
unless pages.empty?
# here we need to parse all pages
# with urls like url(page)
# how can I do it more efficiently?
end
如何在synchrony迭代器循環中實現嵌套的EM HTTP調用?
我正在嘗試不同的方法,但每次我都會遇到“無法從根光纖中產生”或者錯誤調用錯誤的錯誤。
一種解決方案是使用FiberIterator
和同步.get
而不是.aget
:
require "em-synchrony"
require "em-synchrony/em-http"
require "em-synchrony/fiber_iterator"
def url page = nil
url = "http://gistflow.com/all"
url << "?page=#{page}" if page
url
end
EM.synchrony do
concurrency = 2
master_pages = [1,2,3,4]
EM::Synchrony::FiberIterator.new(master_pages, concurrency).each do |iter|
result = EM::HttpRequest.new(url).get
if result
puts "SUCC #{iter}"
detail_pages = [1,2,3,4]
EM::Synchrony::FiberIterator.new(detail_pages, concurrency).each do |iter2|
result2 = EM::HttpRequest.new(url).get
puts "SUCC/ERR #{iter} > #{iter2}"
end
else
puts "ERR #{iter}"
end
end
EM.stop
end
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.