简体   繁体   English

使用Nokogiri解析标签之间的HTML

[英]Parsing HTML between tags with Nokogiri

Here's what my HTML file looks like: 我的HTML文件如下所示:

<a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico,&nbsp;VA<br />
<a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock,&nbsp;NC<br />
<a href='http://crossfitpentagon.com/' target='_blank'>CrossFit Pentagon</a> - Washington,&nbsp;DC<br />
<a href='http://crossfitwtbn.blogspot.com/' target='_blank'>CrossFit WTBN</a> - Quantico,&nbsp;VA<br />
<a href='http://cfnewriver.blogspot.com/' target='_blank'>CrossFit New River</a> - Jacksonville,&nbsp;NC<br />
<a href='http://xfitmiramar.com' target='_blank'>CrossFit Miramar</a> - San Diego,&nbsp;CA<br />
<a href='http://www.crossfitfortmeade.com/' target='_blank'>CrossFit Fort Meade</a> - Odenton,&nbsp;MD<br />

I was able to extract the link content/copy and URL but I also need to extract the information that is between the end of </a> and the beginning of the next <a> , whatever is right before the <br /> . 我能够提取链接内容/副本和URL,但我还需要提取</a>末尾与下一个<a>开头之间的信息,无论在<br />之前是什么。 For example, in the first line I need to extract "Quantico,&nbsp;VA" . 例如,在第一行中,我需要提取"Quantico,&nbsp;VA"

Here's part of my code where I extract part of the information that I need: here is what I'm doing so far (once I get the page object I'll have a loop to run through each line of the html source code that I have in order to extract all of the data I need): 这是我的代码的一部分,在其中提取了我需要的部分信息:这就是到目前为止我正在做的事情(一旦获得页面对象,我将有一个循环遍历我的html源代码的每一行为了提取我需要的所有数据):

page = Nokogiri::HTML(open("http://www.crossfit.com/cf-info/main_affil.htm")) 
if page.text != ""
    ## Get the URL and Name
    if page.css("a")[i] != nil
        name = page.css("a")[i].text
    else
        name = 'NA'
    end
    if page.css("a")[i] != nil
        url = page.css("a")[i]["href"]
    else
        url = 'NA'
    end
end if

Read through the XML::Node and XML::NodeSet documentation. 通读XML :: NodeXML :: NodeSet文档。 The methods available are there to make it possible to navigate and extract nodes: 可用的方法使导航和提取节点成为可能:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico,&nbsp;VA<br />
<a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock,&nbsp;NC<br />
</body>
</html>
EOT

data = doc.search('a').map{ |link|
  link_href = link['href']
  link_text = link.text
  trailing_text = link.next_sibling.text
  {
    href: link_href,
    text: link_text,
    trailing_text: trailing_text
  }
}

data will contain: data将包含:

data 
# => [{:href=>"http://crossfitquantico.blogspot.com/",
#      :text=>"CrossFit Quantico",
#      :trailing_text=>" - Quantico,\u00A0VA"},
#     {:href=>"http://www.crossfitcherrypoint.com",
#      :text=>"CrossFit Cherry Point",
#      :trailing_text=>" - Havelock,\u00A0NC"}]

Don't do this: 不要这样做:

page = Nokogiri::HTML(open("http://www.crossfit.com/cf-info/main_affil.htm")) 
if page.text != ""
    ## Get the URL and Name
    if page.css("a")[i] != nil
        name = page.css("a")[i].text
    else
        name = 'NA'
    end
    if page.css("a")[i] != nil
        url = page.css("a")[i]["href"]
    else
        url = 'NA'
    end
end if

if page.text != "" doesn't really tell you what you want to know, which is whether there are links. if page.text != ""并没有真正告诉您您想知道什么,即是否有链接。 Simply searching the document will tell you that. 只需搜索文档即可告诉您。

You're searching the DOM for links each time you use page.css("a") which wastes CPU. 每次使用page.css("a")都会在DOM中搜索链接,这会浪费CPU。 Testing page.css("a")[i] != nil is a waste too. 测试page.css("a")[i] != nil也很浪费。 If you iterate over a syntactically-correct document containing links correctly you'll never have situations where you couldn't find a link because search or its act-alikes will have handed them to you. 如果您对语法正确的文档进行正确的迭代,则包含链接的您将永远不会遇到找不到链接的情况,因为search或其行为类似的东西会把它们交给您。

Here's a minor tweak to the above code to provide "NA" values: 这是对上面代码的一个小调整,以提供“ NA”值:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico,&nbsp;VA<br />
    <a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock,&nbsp;NC<br />
    <a ></a>  
  </body>
</html>
EOT

doc.search('a').class # => Nokogiri::XML::NodeSet
doc.search('a').size # => 3

data = doc.search('a').map{ |link|
  link_href = link['href']
  link_text = link.text
  trailing_text = link.next_sibling.text
  {
    href: link_href || 'NA',
    text: link_text.empty? ? 'NA' : link_text,
    trailing_text: trailing_text
  }
}

data.size # => 3
data.class # => Array
data.first.class # => Hash

data 
# => [{:href=>"http://crossfitquantico.blogspot.com/",
#      :text=>"CrossFit Quantico",
#      :trailing_text=>" - Quantico,\u00A0VA"},
#     {:href=>"http://www.crossfitcherrypoint.com",
#      :text=>"CrossFit Cherry Point",
#      :trailing_text=>" - Havelock,\u00A0NC"},
#     {:href=>"NA", :text=>"NA", :trailing_text=>"  \n  "}]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM