简体   繁体   中英

Ruby parse <a> link info from Nokogiri::XML::NodeSet

I pulled a Nokogiri::XML::NodeSet from a page and here's the result:

<a href="http://www.goldsteinpatentlaw.com" target="_blank" title="Goldstein Patent Law ( U.S.A. )">
    <img src="http://www.asdf.com/LBM_Images/Offices//law-firm-goldstein-patent-law-photo-1258381.jpg" height="62" width="100" alt="Goldstein Patent Law (U.S.A.)">
</a>

I can't figure out how to turn that (obvious to humans) <a> tag into a Mechanize/Nokogiri-parsed object so I can easily retrieve bits of info from the link.

The Nokogiri/Mechanize docs are really confusing because I never know which to look at. Not sure which came first, which uses which, etc. Seems very overcomplicated for the simple scraping & parsing I'm trying to do.

A NodeSet is like an array. If you use puts() on a NodeSet, then, just like when you use puts on an Array, ruby will output a string representation of each item in the NodeSet on a separate line. NodeSets can contain various objects, but typically they will contain objects called <Nokogiri::XML::Element> , which represent the tags in your html.

It is apparent from your output that your Nodeset has only one item, and what you are seeing is the string representation of that item. Here is an example:

require 'nokogiri'

str = "<div>hello</div><div>world</div>"
html_doc = Nokogiri::HTML(str)

divs = html_doc.xpath("//div")

divs.each do |div|
  p div
end

puts '*' * 10
puts divs


    --output:--
#<Nokogiri::XML::Element:0x80836ec4 name="div" children=[#<Nokogiri::XML::Text:0x80836a00 "hello">]>
#<Nokogiri::XML::Element:0x80836668 name="div" children=[#<Nokogiri::XML::Text:0x80836064 "world">]>
**********
<div>hello</div>
<div>world</div>

So you just have to retrieve the first element of your NodeSet, just like you would retrieve the first element in an Array:

p divs[0]

Or, if you know there is only going to be one element in your NodeSet, then you can use:

div = html_doc.at_xpath("//div")

which instead of returning a NodeSet just returns the first Element matching the xpath.

When you really want to know what you've got, you should use p instead of puts .

IS this what you are looking for?

require 'nokogiri'
str = '<a href="http://www.goldsteinpatentlaw.com" target="_blank" title="Goldstein Patent Law ( U.S.A. )">
          <img src="http://www.asdf.com/LBM_Images/Offices//law-firm-goldstein-patent-law-photo-1258381.jpg" height="62" width="100" alt="Goldstein Patent Law (U.S.A.)">
       </a>'
doc = Nokogiri::HTML(str)
link = doc.at('a')
#=> #<Nokogiri::XML::Element:0x1744488 name="a" attributes=[
     #<Nokogiri::XML::Attr:0x174444c name="href" value="http://www.goldsteinpatentlaw.com">, 
     #<Nokogiri::XML::Attr:0x1744440 name="target" value="_blank">,
     #<Nokogiri::XML::Attr:0x1744434 name="title" value="Goldstein Patent Law ( U.S.A. )">] children=[#<Nokogiri::XML::Text:0x1743d20 "\n    ">, 
     #<Nokogiri::XML::Element:0x1743c9c name="img" attributes=[#<Nokogiri::XML::Attr:0x1743c60 name="src" value="http://www.asdf.com/LBM_Images/Offices//law-firm-goldstein-patent-law-photo-1258381.jpg">, 
     #<Nokogiri::XML::Attr:0x1743c54 name="height" value="62">, #<Nokogiri::XML::Attr:0x1743c48 name="width" value="100">, 
     #<Nokogiri::XML::Attr:0x1743c3c name="alt" value="Goldstein Patent Law (U.S.A.)">]>,
     #<Nokogiri::XML::Text:0x17433d8 "\n">]>

You can use at , at_css or at_xpath selectors to get just what you want then do things like

link.attributes["href"].value
#=> "http://www.goldsteinpatentlaw.com"
link.attributes["title"].value
#=> "Goldstein Patent Law ( U.S.A. )"

Maybe a little late here, but for more on NodeSets specifically, look here: http://www.rubydoc.info/gems/nokogiri/Nokogiri/XML/NodeSet#attr-instance_method

According to their docs, this is the code I used to do what you were trying to do and it works!

result.search("h2 > a").attr("href")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM