简体   繁体   中英

Mechanize won't conect to site

Welcome, I got a problem, gem mechanize won't connect to a site. Gem is installed. Code:

require 'mechanize'

agent = Mechanize.new
main_page = agent.get 'https://imbd.com'
main_page.link_with(text: "Top 250").click
rows = list_page.root.css(".lister-list tr")

puts rows.size

And this is an error:

C:/Ruby/lib/ruby/2.2.0/net/http.rb:879:in `initialize': A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. - connect(2) for "imbd.com" port 80 (Errno::ETIMEDOUT)
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:879:in `open'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:879:in `block in connect'
    from C:/Ruby/lib/ruby/2.2.0/timeout.rb:73:in `timeout'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:878:in `connect'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:863:in `do_start'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:858:in `start'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:700:in `start'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:631:in `connection_for'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:994:in `request'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/mechanize-2.7.4/lib/mechanize/http/agent.rb:267:in `fetch'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/mechanize-2.7.4/lib/mechanize.rb:464:in `get'
    from C:/Ruby/Workspace/imbd.rb:4:in `<main>'

Anyone has any idea what's wrong? Thanks!

After looking at imdb I see they're running heavy amounts of javascript which will trip Mechanize up since it can't parse js and understand the incoming response. I would suggest using Capybara instead of Mechanize if you're looking to scrape the content or automate browsing. Combining Capybara with something like Poltergeist (you'll need to install phantom.js with this approach) will work far better than Mechanize and is built for automating interaction with pages loading lots of js.

I added a way to maybe work around the error for you. If this works it's because Mechanize was trying to get the page before the js scripts were done and therefore not getting valid data.

Edit:

  agent = Mechanize.new
  agent.read_timeout=3  #set the agent time out
  begin
  main_page = agent.get 'https://imbd.com'
  main_page.link_with(text: "Top 250").click
  rows = list_page.root.css(".lister-list tr")
  rescue Timeout::Error 
    puts "Timeout!"
    puts "read_timeout attribute is set to #{agent.read_timeout}s" if !agent.read_timeout.nil?
  end

While it's true that mechanize doesn't support javascript, your problem is that you are trying to access a site that doesn't exist. You are trying to access www.imbd.com instead of www.imdb.com . So, the error message is accurate.

And FWIW, IMDB doesn't want you to scrape their site:

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM