[英]Parsing HTML between tags with Nokogiri
Here's what my HTML file looks like: 我的HTML文件如下所示:
<a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico, VA<br />
<a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock, NC<br />
<a href='http://crossfitpentagon.com/' target='_blank'>CrossFit Pentagon</a> - Washington, DC<br />
<a href='http://crossfitwtbn.blogspot.com/' target='_blank'>CrossFit WTBN</a> - Quantico, VA<br />
<a href='http://cfnewriver.blogspot.com/' target='_blank'>CrossFit New River</a> - Jacksonville, NC<br />
<a href='http://xfitmiramar.com' target='_blank'>CrossFit Miramar</a> - San Diego, CA<br />
<a href='http://www.crossfitfortmeade.com/' target='_blank'>CrossFit Fort Meade</a> - Odenton, MD<br />
I was able to extract the link content/copy and URL but I also need to extract the information that is between the end of </a>
and the beginning of the next <a>
, whatever is right before the <br />
. 我能够提取链接内容/副本和URL,但我还需要提取
</a>
末尾与下一个<a>
开头之间的信息,无论在<br />
之前是什么。 For example, in the first line I need to extract "Quantico, VA"
. 例如,在第一行中,我需要提取
"Quantico, VA"
。
Here's part of my code where I extract part of the information that I need: here is what I'm doing so far (once I get the page object I'll have a loop to run through each line of the html source code that I have in order to extract all of the data I need): 这是我的代码的一部分,在其中提取了我需要的部分信息:这就是到目前为止我正在做的事情(一旦获得页面对象,我将有一个循环遍历我的html源代码的每一行为了提取我需要的所有数据):
page = Nokogiri::HTML(open("http://www.crossfit.com/cf-info/main_affil.htm"))
if page.text != ""
## Get the URL and Name
if page.css("a")[i] != nil
name = page.css("a")[i].text
else
name = 'NA'
end
if page.css("a")[i] != nil
url = page.css("a")[i]["href"]
else
url = 'NA'
end
end if
Read through the XML::Node and XML::NodeSet documentation. 通读XML :: Node和XML :: NodeSet文档。 The methods available are there to make it possible to navigate and extract nodes:
可用的方法使导航和提取节点成为可能:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico, VA<br />
<a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock, NC<br />
</body>
</html>
EOT
data = doc.search('a').map{ |link|
link_href = link['href']
link_text = link.text
trailing_text = link.next_sibling.text
{
href: link_href,
text: link_text,
trailing_text: trailing_text
}
}
data
will contain: data
将包含:
data
# => [{:href=>"http://crossfitquantico.blogspot.com/",
# :text=>"CrossFit Quantico",
# :trailing_text=>" - Quantico,\u00A0VA"},
# {:href=>"http://www.crossfitcherrypoint.com",
# :text=>"CrossFit Cherry Point",
# :trailing_text=>" - Havelock,\u00A0NC"}]
Don't do this: 不要这样做:
page = Nokogiri::HTML(open("http://www.crossfit.com/cf-info/main_affil.htm"))
if page.text != ""
## Get the URL and Name
if page.css("a")[i] != nil
name = page.css("a")[i].text
else
name = 'NA'
end
if page.css("a")[i] != nil
url = page.css("a")[i]["href"]
else
url = 'NA'
end
end if
if page.text != ""
doesn't really tell you what you want to know, which is whether there are links. if page.text != ""
并没有真正告诉您您想知道什么,即是否有链接。 Simply searching the document will tell you that. 只需搜索文档即可告诉您。
You're searching the DOM for links each time you use page.css("a")
which wastes CPU. 每次使用
page.css("a")
都会在DOM中搜索链接,这会浪费CPU。 Testing page.css("a")[i] != nil
is a waste too. 测试
page.css("a")[i] != nil
也很浪费。 If you iterate over a syntactically-correct document containing links correctly you'll never have situations where you couldn't find a link because search
or its act-alikes will have handed them to you. 如果您对语法正确的文档进行正确的迭代,则包含链接的您将永远不会遇到找不到链接的情况,因为
search
或其行为类似的东西会把它们交给您。
Here's a minor tweak to the above code to provide "NA" values: 这是对上面代码的一个小调整,以提供“ NA”值:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico, VA<br />
<a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock, NC<br />
<a ></a>
</body>
</html>
EOT
doc.search('a').class # => Nokogiri::XML::NodeSet
doc.search('a').size # => 3
data = doc.search('a').map{ |link|
link_href = link['href']
link_text = link.text
trailing_text = link.next_sibling.text
{
href: link_href || 'NA',
text: link_text.empty? ? 'NA' : link_text,
trailing_text: trailing_text
}
}
data.size # => 3
data.class # => Array
data.first.class # => Hash
data
# => [{:href=>"http://crossfitquantico.blogspot.com/",
# :text=>"CrossFit Quantico",
# :trailing_text=>" - Quantico,\u00A0VA"},
# {:href=>"http://www.crossfitcherrypoint.com",
# :text=>"CrossFit Cherry Point",
# :trailing_text=>" - Havelock,\u00A0NC"},
# {:href=>"NA", :text=>"NA", :trailing_text=>" \n "}]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.