Here's what my HTML file looks like:
<a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico, VA<br />
<a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock, NC<br />
<a href='http://crossfitpentagon.com/' target='_blank'>CrossFit Pentagon</a> - Washington, DC<br />
<a href='http://crossfitwtbn.blogspot.com/' target='_blank'>CrossFit WTBN</a> - Quantico, VA<br />
<a href='http://cfnewriver.blogspot.com/' target='_blank'>CrossFit New River</a> - Jacksonville, NC<br />
<a href='http://xfitmiramar.com' target='_blank'>CrossFit Miramar</a> - San Diego, CA<br />
<a href='http://www.crossfitfortmeade.com/' target='_blank'>CrossFit Fort Meade</a> - Odenton, MD<br />
I was able to extract the link content/copy and URL but I also need to extract the information that is between the end of </a>
and the beginning of the next <a>
, whatever is right before the <br />
. For example, in the first line I need to extract "Quantico, VA"
.
Here's part of my code where I extract part of the information that I need: here is what I'm doing so far (once I get the page object I'll have a loop to run through each line of the html source code that I have in order to extract all of the data I need):
page = Nokogiri::HTML(open("http://www.crossfit.com/cf-info/main_affil.htm"))
if page.text != ""
## Get the URL and Name
if page.css("a")[i] != nil
name = page.css("a")[i].text
else
name = 'NA'
end
if page.css("a")[i] != nil
url = page.css("a")[i]["href"]
else
url = 'NA'
end
end if
Read through the XML::Node and XML::NodeSet documentation. The methods available are there to make it possible to navigate and extract nodes:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico, VA<br />
<a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock, NC<br />
</body>
</html>
EOT
data = doc.search('a').map{ |link|
link_href = link['href']
link_text = link.text
trailing_text = link.next_sibling.text
{
href: link_href,
text: link_text,
trailing_text: trailing_text
}
}
data
will contain:
data
# => [{:href=>"http://crossfitquantico.blogspot.com/",
# :text=>"CrossFit Quantico",
# :trailing_text=>" - Quantico,\u00A0VA"},
# {:href=>"http://www.crossfitcherrypoint.com",
# :text=>"CrossFit Cherry Point",
# :trailing_text=>" - Havelock,\u00A0NC"}]
Don't do this:
page = Nokogiri::HTML(open("http://www.crossfit.com/cf-info/main_affil.htm"))
if page.text != ""
## Get the URL and Name
if page.css("a")[i] != nil
name = page.css("a")[i].text
else
name = 'NA'
end
if page.css("a")[i] != nil
url = page.css("a")[i]["href"]
else
url = 'NA'
end
end if
if page.text != ""
doesn't really tell you what you want to know, which is whether there are links. Simply searching the document will tell you that.
You're searching the DOM for links each time you use page.css("a")
which wastes CPU. Testing page.css("a")[i] != nil
is a waste too. If you iterate over a syntactically-correct document containing links correctly you'll never have situations where you couldn't find a link because search
or its act-alikes will have handed them to you.
Here's a minor tweak to the above code to provide "NA" values:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico, VA<br />
<a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock, NC<br />
<a ></a>
</body>
</html>
EOT
doc.search('a').class # => Nokogiri::XML::NodeSet
doc.search('a').size # => 3
data = doc.search('a').map{ |link|
link_href = link['href']
link_text = link.text
trailing_text = link.next_sibling.text
{
href: link_href || 'NA',
text: link_text.empty? ? 'NA' : link_text,
trailing_text: trailing_text
}
}
data.size # => 3
data.class # => Array
data.first.class # => Hash
data
# => [{:href=>"http://crossfitquantico.blogspot.com/",
# :text=>"CrossFit Quantico",
# :trailing_text=>" - Quantico,\u00A0VA"},
# {:href=>"http://www.crossfitcherrypoint.com",
# :text=>"CrossFit Cherry Point",
# :trailing_text=>" - Havelock,\u00A0NC"},
# {:href=>"NA", :text=>"NA", :trailing_text=>" \n "}]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.