I have to read xml files that are accessibles through http with authentication. That's why I use mechanize.
My problem is that I can't get mechanize to recognize these XML files so I can use .find or .search on them.
Here is what I tried first - in my view (html file)
<% agent = Mechanize.new %>
<% page = agent.get("http://dl.dropbox.com/u/344349/xml.xml") %>
<%= page %>
Which returns #<Mechanize::File:0x007f9dd602de30>
. It's ::File
and not ::Page
I can't use a .find or .search on this as it'll error with undefined method find for #<Mechanize::File:0x007f9dd624cbd0>
Mechanize doc says : This is the default (and base) class for the Pluggable Parsers. If Mechanize cannot find an appropriate class to use for the content type, this class will be used. For example, if you download a JPG, Mechanize will not know how to parse it, so this class will be instantiated.
So I created a class as described here : http://rdoc.info/github/tenderlove/mechanize/master/Mechanize/PluggableParser
My class
class XMLParser < Mechanize::File
attr_reader :xml
def initialize(uri=nil, response=nil, body=nil, code=nil)
super(uri, response, body, code)
@xml = xml.parse(body)
end
end
and the updated code in my view (html file)
<% agent = Mechanize.new %>
<% agent.pluggable_parser['text/xml'] = XMLParser %>
<% agent.user_agent_alias = 'Windows Mozilla' %>
<% page = agent.get("http://dl.dropbox.com/u/344349/xml.xml") %>
<%= page %>
or even
<% agent = Mechanize.new %>
<% agent.pluggable_parser.xml = XMLParser %>
<% page1 = agent.get('http://dl.dropbox.com/u/344349/xml.xml') # => CSVParser %>
<%= page1 %>
Still returns #<Mechanize::File:0x007f9dd5253b48>
I even tested the exact code (CSVParser - http://rdoc.info/github/tenderlove/mechanize/master/Mechanize/PluggableParser ) and tried loading a csv file that is still seen as a ::File.
What am I doing wrong ?
Okay, so I've resolved this problem for myself just now. The solution is in two parts:
First, the content type you are matching is incorrect. If you run this line, after you do your get, it will tell you what the content type is for the document you are getting:
page.response['content-type'] # => 'application/xml', not 'text/xml'
When I use mechanize to get your page ('http://dl.dropbox.com/u/344349/xml.xml'), I see 'application/xml' as the content type.
Second, you're not using PluggableParser correctly. Using XMLParser as you have it here will generate NoMethodError: undefined method 'parse' for nil:NilClass
. Change the class definition to use Nokogiri::XML instead:
class XmlParser < Mechanize::File
attr_reader :xml
def initialize(uri = nil, response = nil, body = nil, code = nil)
@xml = Nokogiri::XML(body)
super uri, response, body, code
end
end
Then, set this as the parser for the correct content type:
mech.pluggable_parser['application/xml'] = XmlParser
To use this, you'll get your page the same as before, and then reference the xml attribute of the page object as a Nokogiri::XML::Document instance, which is a subclass of Nokogiri::XML::Node . Fortunately, Mechanize::Page.search is just a wrapper around Nokogiri::XML::Node.search , so you can search the same way you expect, pretty much. Like this:
page.xml.search 'catalog'
A further refinement would be to map XmlParser.search to the Nokogiri .search methods:
# This is the same as what Mechanize::Page does
class XmlParser < Mechanize::File
extend Forwardable
def_delegators :@xml, :search, :/, :at
end
This lets you perform your searches directly on the page instance:
page.search 'catalog'
You can change the parser to use Page class, like it:
agent = Mechanize.new
agent.pluggable_parser.default = Mechanize::Page
agent.get("http://dl.dropbox.com/u/344349/xml.xml").class # Mechanize::Page
See http://mechanize.rubyforge.org/Mechanize/PluggableParser.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.