简体   繁体   中英

Parsing HTML between tags with Nokogiri

Here's what my HTML file looks like:

<a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico,&nbsp;VA<br />
<a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock,&nbsp;NC<br />
<a href='http://crossfitpentagon.com/' target='_blank'>CrossFit Pentagon</a> - Washington,&nbsp;DC<br />
<a href='http://crossfitwtbn.blogspot.com/' target='_blank'>CrossFit WTBN</a> - Quantico,&nbsp;VA<br />
<a href='http://cfnewriver.blogspot.com/' target='_blank'>CrossFit New River</a> - Jacksonville,&nbsp;NC<br />
<a href='http://xfitmiramar.com' target='_blank'>CrossFit Miramar</a> - San Diego,&nbsp;CA<br />
<a href='http://www.crossfitfortmeade.com/' target='_blank'>CrossFit Fort Meade</a> - Odenton,&nbsp;MD<br />

I was able to extract the link content/copy and URL but I also need to extract the information that is between the end of </a> and the beginning of the next <a> , whatever is right before the <br /> . For example, in the first line I need to extract "Quantico,&nbsp;VA" .

Here's part of my code where I extract part of the information that I need: here is what I'm doing so far (once I get the page object I'll have a loop to run through each line of the html source code that I have in order to extract all of the data I need):

page = Nokogiri::HTML(open("http://www.crossfit.com/cf-info/main_affil.htm")) 
if page.text != ""
    ## Get the URL and Name
    if page.css("a")[i] != nil
        name = page.css("a")[i].text
    else
        name = 'NA'
    end
    if page.css("a")[i] != nil
        url = page.css("a")[i]["href"]
    else
        url = 'NA'
    end
end if

Read through the XML::Node and XML::NodeSet documentation. The methods available are there to make it possible to navigate and extract nodes:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico,&nbsp;VA<br />
<a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock,&nbsp;NC<br />
</body>
</html>
EOT

data = doc.search('a').map{ |link|
  link_href = link['href']
  link_text = link.text
  trailing_text = link.next_sibling.text
  {
    href: link_href,
    text: link_text,
    trailing_text: trailing_text
  }
}

data will contain:

data 
# => [{:href=>"http://crossfitquantico.blogspot.com/",
#      :text=>"CrossFit Quantico",
#      :trailing_text=>" - Quantico,\u00A0VA"},
#     {:href=>"http://www.crossfitcherrypoint.com",
#      :text=>"CrossFit Cherry Point",
#      :trailing_text=>" - Havelock,\u00A0NC"}]

Don't do this:

page = Nokogiri::HTML(open("http://www.crossfit.com/cf-info/main_affil.htm")) 
if page.text != ""
    ## Get the URL and Name
    if page.css("a")[i] != nil
        name = page.css("a")[i].text
    else
        name = 'NA'
    end
    if page.css("a")[i] != nil
        url = page.css("a")[i]["href"]
    else
        url = 'NA'
    end
end if

if page.text != "" doesn't really tell you what you want to know, which is whether there are links. Simply searching the document will tell you that.

You're searching the DOM for links each time you use page.css("a") which wastes CPU. Testing page.css("a")[i] != nil is a waste too. If you iterate over a syntactically-correct document containing links correctly you'll never have situations where you couldn't find a link because search or its act-alikes will have handed them to you.

Here's a minor tweak to the above code to provide "NA" values:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico,&nbsp;VA<br />
    <a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock,&nbsp;NC<br />
    <a ></a>  
  </body>
</html>
EOT

doc.search('a').class # => Nokogiri::XML::NodeSet
doc.search('a').size # => 3

data = doc.search('a').map{ |link|
  link_href = link['href']
  link_text = link.text
  trailing_text = link.next_sibling.text
  {
    href: link_href || 'NA',
    text: link_text.empty? ? 'NA' : link_text,
    trailing_text: trailing_text
  }
}

data.size # => 3
data.class # => Array
data.first.class # => Hash

data 
# => [{:href=>"http://crossfitquantico.blogspot.com/",
#      :text=>"CrossFit Quantico",
#      :trailing_text=>" - Quantico,\u00A0VA"},
#     {:href=>"http://www.crossfitcherrypoint.com",
#      :text=>"CrossFit Cherry Point",
#      :trailing_text=>" - Havelock,\u00A0NC"},
#     {:href=>"NA", :text=>"NA", :trailing_text=>"  \n  "}]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM