简体   繁体   English

使用OpenUri,我如何获取重定向页面的内容?

[英]Using OpenUri, how can I get the contents of a redirecting page?

I want to get data from this page: 我想从这个页面获取数据:

http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=0656887000494793

But that page forwards to: 但该页面转发到:

http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?execution=eXs1

So, when I use open , from OpenUri, to try and fetch the data, it throws a RuntimeError error saying HTTP redirection loop: 因此,当我从OpenUri使用open来尝试获取数据时,它会抛出一个RuntimeError错误,说明HTTP redirection loop:

I'm not really sure how to get that data after it redirects and throws that error. 我不确定如何在重定向并抛出该错误后获取该数据。

You need a tool like Mechanize . 你需要像Mechanize这样的工具。 From it's description: 从它的描述:

The Mechanize library is used for automating interaction with websites. Mechanize库用于自动与网站交互。 Mechanize automatically stores and sends cookies, follows redirects, can follow links, and submit forms. Mechanize自动存储和发送cookie,遵循重定向,可以跟踪链接和提交表单。 Form fields can be populated and submitted. 可以填充和提交表单字段。 Mechanize also keeps track of the sites that you have visited as a history. Mechanize还会跟踪您作为历史记录访​​问过的站点。

which is exactly what you need. 这正是你所需要的。 So, 所以,

sudo gem install mechanize

then 然后

require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get "http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber trackingNumber=0656887000494793"

page.content # Get the resulting page as a string
page.body # Get the body content of the resulting page as a string
page.search(".somecss") # Search for specific elements by XPath/CSS using nokogiri

and you're ready to rock 'n' roll. 而且你已经准备好摇滚了。

While mechanize is a wonderful tool I prefer to "cook" my own thing. 虽然机械化是一个很好的工具,但我更喜欢“烹饪”自己的东西。

If you are serious about parsing you can take a look at this code. 如果您认真解析,可以查看此代码。 It serves to crawl thousands of site on an international level everyday and as far as I have researched and tweaked there isn't a more stable approach to this that also allows you to highly customize later on your needs. 它每天在国际层面上抓取成千上万的网站,据我所研究和调整,没有更稳定的方法,这也允许您以后高度定制您的需求。

require "open-uri"
require "zlib"
require "nokogiri"
require "sanitize"
require "htmlentities"
require "readability"

def crawl(url_address)
self.errors = Array.new
begin
  begin
    url_address = URI.parse(url_address)
  rescue URI::InvalidURIError
    url_address = URI.decode(url_address)
    url_address = URI.encode(url_address)
    url_address = URI.parse(url_address)
  end
  url_address.normalize!
  stream = ""
  timeout(8) { stream = url_address.open(SHINSO_HEADERS) }
  if stream.size > 0
    url_crawled = URI.parse(stream.base_uri.to_s)
  else
    self.errors << "Server said status 200 OK but document file is zero bytes."
    return
  end
rescue Exception => exception
  self.errors << exception
  return
end
# extract information before html parsing
self.url_posted       = url_address.to_s
self.url_parsed       = url_crawled.to_s
self.url_host         = url_crawled.host
self.status           = stream.status
self.content_type     = stream.content_type
self.content_encoding = stream.content_encoding
self.charset          = stream.charset
if    stream.content_encoding.include?('gzip')
  document = Zlib::GzipReader.new(stream).read
elsif stream.content_encoding.include?('deflate')
  document = Zlib::Deflate.new().deflate(stream).read
#elsif stream.content_encoding.include?('x-gzip') or
#elsif stream.content_encoding.include?('compress')
else
  document = stream.read
end
self.charset_guess = CharGuess.guess(document)
if not self.charset_guess.blank? and (not self.charset_guess.downcase == 'utf-8' or not self.charset_guess.downcase == 'utf8')
  document = Iconv.iconv("UTF-8", self.charset_guess, document).to_s
end
document = Nokogiri::HTML.parse(document,nil,"utf8")
document.xpath('//script').remove
document.xpath('//SCRIPT').remove
for item in document.xpath('//*[translate(@src, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz")]')
  item.set_attribute('src',make_absolute_address(item['src']))
end
document = document.to_s.gsub(/<!--(.|\s)*?-->/,'')
self.content = Nokogiri::HTML.parse(document,nil,"utf8")
end

The site seems to be doing some of the redirection logic with sessions. 该网站似乎正在使用会话进行一些重定向逻辑。 If you don't send back the session cookies they are sending on the first request you will end up in a redirect loop. 如果您没有发回他们在第一次请求时发送的会话cookie,您将最终进入重定向循环。 IMHO it's a crappy implementation on their part. 恕我直言,这是他们的一个糟糕的实施。

However, I tried to pass the cookies back to them, but I didn't get it to work, so I can't be completely sure that that is all that's going on here. 但是,我试图将cookie传递给他们,但我没有让它工作,所以我不能完全确定这就是这里发生的一切。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM