简体   繁体   中英

Randomness Timeout:Error Exception in Ruby with Mechanize Gem

I'm building an application in Ruby 1.9.3-p327 that fetch-parse some pages(scrapping) and then according some values insert/update some columns into the database. In order to fetch-parse, the app use Mechanize gem, and the access to the database(MySQL) is through activerecord gem.

The weird problem that I had is that sometimes a Timeout::Error exception is raised randomness, sometimes never happens but maybe in two more days will happen, and with different type of records or pages. The log of the exception is:

/root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/protocol.rb:146:in `rescue in rbuf_fill': too many connection resets (due to Timeout::Error - Timeout::Error) after 0 requests on 21716860, last used 1378984537.2796552 seconds ago (Net::HTTP::Persistent::Error)
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/protocol.rb:140:in `rbuf_fill'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/protocol.rb:122:in `readuntil'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/protocol.rb:132:in `readline'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/http.rb:2562:in `read_status_line'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/http.rb:2551:in `read_new'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/http.rb:1319:in `block in transport_request'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/http.rb:1316:in `catch'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/http.rb:1316:in `transport_request'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/http.rb:1293:in `request'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.9/lib/net/http/persistent.rb:986:in `request'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:257:in `fetch'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/mechanize-2.7.2/lib/mechanize.rb:432:in `get'
    from /root/notificador-corte/lib/downloader.rb:10:in `fetch'
    from /root/notificador-corte/worker.rb:63:in `fetch_page'
    from /root/notificador-corte/worker.rb:49:in `process_causa'
    from /root/notificador-corte/worker.rb:41:in `block in worker_main_cycle'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/activerecord-4.0.0/lib/active_record/relation/delegation.rb:13:in `each'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/activerecord-4.0.0/lib/active_record/relation/delegation.rb:13:in `each'
    from /root/notificador-corte/worker.rb:39:in `worker_main_cycle'
    from /root/notificador-corte/worker.rb:26:in `run'
    from /root/notificador-corte/app.rb:12:in `<main>'

The downloader.rb line 10 contains the definition of the method fetch:

def fetch(url)
    begin
      @agent.get(url) )
    rescue Errno::ETIMEDOUT, Timeout::Error => exception
    end
  end

The worker.rb in line 63 contains the call to the fetch method.

Reading the documentation, said that I should be trying setting the read_timeout , open_timeout properties for the agent(Mechanize), and also try with idle_timeout , keep_alive , but the error still remains randomness.

The content of the Gemfile is:

gem 'activerecord', "~> 4.0.0" 
gem 'mechanize', "~> 2.7.1"
gem 'mysql', '~> 2.9.1'
gem 'actionmailer', "~> 4.0.0" 
gem 'rspec', "~> 2.14.1"

I don't think it necessarily is a bug in either your code or mechanize it self. Most likely it's a network issue.

I would rather implement a policy into that rescue statement, so that you make sure, that whenever this error occurs, you make sure to "retry" at a later point.

When you rescue you have

 Errno::ETIMEDOUT

Is this miss-spelled? Or is it something I'm unfamiliar with?

It's very possible that your issue is some bad websites or links. I've had all sorts of issues when I scrape the internet at large. I found it's best to catch all errors, print the error message, and continue to the next possible operation.. That way your scraper won't stop on the bad cases and you can go back and resolve issues as they come up.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM