简体   繁体   English

如何使用nokogiri在一对相同的标签之间获取HTML?

[英]How to get HTML between a pair of same tags using nokogiri?

I want to scrape a HTML file like that: 我想刮一个HTML文件,像这样:

<div id="hoge">
  <h1><span>title 1</span></h1>

    <h2><span>subtitle 1-1</span></h2>
    <p></p>
    <table class="fuga"><span>data 1-1</span></table>
    <p></p>

    //(the same structure repeated n times)

    <h2><span>subtitle 1-(n+2)<span/></h2>
    <p></p>
    <table class="fuga"><span>data 1-(n+2)</span></table>
    <p></p>


  //(the same structure repeated m times)

  <h1><span>title m</span></h1>

    <h2><span>subtitle m-1</span></h2>
    <p></p>
    <table class="fuga"><span>data m-1</span></table>
    <p></p>

    //(the same structure repeated l times)

    <h2><span>subtitle m-(l+2)</span></h2>
    <p></p>
    <table class="fuga"><span>data m-(l+2)</span></table>
    <p></p>


</div>

I need values of table(in the example, represented in data xy .) for each subtitle( "subtitle xy" ) for each title( "title x" ). 我需要每个title( "title x" )的每个subtitle( "subtitle xy" )的table(在示例中,以data xy表示)的值。
To associate them, I want to cut <h1> ~ the last <p> before the next <h1> , but can't figure out how to do it. 为了关联它们,我想在下一个<h1>之前剪切<h1> 〜最后一个<p> <h1> ,但是不知道该怎么做。
I spent 5 hours to search, read, try and error, and finally came to write the code below, but it still don't work. 我花了5个小时来搜索,阅读,尝试和出错,最后才开始编写下面的代码,但仍然无法正常工作。
What's wrong? 怎么了? How can I cut the HTML? 如何剪切HTML?

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://example.com/"))

doc.xpath('//div[@id="mw-content-text"]').each do |node|
  for i in 1..node.xpath('h1').length do
    mininode = node.xpath(%(node()[not(following-sibling::h1[#{i}] or preceding-sibling::h1[#{i+1}])]))

    title = mininode.xpath('h1/span').text
    puts title unless title.empty?
    puts "============"

    for j in 1..mininode.xpath('h2').length do
      puts mininode.xpath(%(h2[#{j}]/span)).text
      puts mininode.xpath(%(table[#{j}]/span)).text
    end
  end
end

Meditate on this: 对此进行冥想:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div id="hoge">
  <h1><span>title 1</span></h1>

    <h2><span>subtitle 1-1</span></h2>
    <p></p>
    <table class="fuga"><span>data 1-1</span></table>
    <p></p>

    //(the same structure repeated n times)

    <h2><span>subtitle 1-(n+2)<span/></h2>
    <p></p>
    <table class="fuga"><span>data 1-(n+2)</span></table>
    <p></p>


  //(the same structure repeated m times)

  <h1><span>title m</span></h1>

    <h2><span>subtitle m-1</span></h2>
    <p></p>
    <table class="fuga"><span>data m-1</span></table>
    <p></p>

    //(the same structure repeated l times)

    <h2><span>subtitle m-(l+2)</span></h2>
    <p></p>
    <table class="fuga"><span>data m-(l+2)</span></table>
    <p></p>


</div>
EOT

Process the doc : 处理doc

div = doc.at('#hoge')
h1_blocks = div.children.slice_before{ |node| node.name == 'h1' }.map{ |nodes| Nokogiri::XML::NodeSet.new(doc, nodes) }

Running that results in h1_blocks containing an array of NodeSets. 运行该命令将导致h1_blocks包含一个NodeSets数组。 Here's the first set based on your HTML: 这是基于HTML的第一组:

h1_blocks[1].map(&:to_html)
# => ["<h1><span>title 1</span></h1>",
#     "\n\n    ",
#     "<h2><span>subtitle 1-1</span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data 1-1</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n    //(the same structure repeated n times)\n\n    ",
#     "<h2><span>subtitle 1-(n+2)<span></span></span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data 1-(n+2)</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n\n  //(the same structure repeated m times)\n\n  "]

Here's the second set, based on your HTML: 这是第二组,基于您的HTML:

h1_blocks[2].map(&:to_html)
# => ["<h1><span>title m</span></h1>",
#     "\n\n    ",
#     "<h2><span>subtitle m-1</span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data m-1</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n    //(the same structure repeated l times)\n\n    ",
#     "<h2><span>subtitle m-(l+2)</span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data m-(l+2)</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n\n"]

How does this work? 这是如何运作的?

Ruby's Enumerable class has slice_before which looks at a comparison, and for each true result, breaks the incoming array into a new sub-array. Ruby的Enumerable类具有slice_before ,它会进行比较,对于每个真实结果,它将传入的数组拆分为一个新的子数组。 This is useful when we have a list of array elements and we have to break that array into separate chunks. 当我们有一个数组元素列表并且必须将该数组分成单独的块时,这很有用。

Often we use it when parsing text that has some sort of repeating blocks that we need to process as a chunk, such as paragraphs, network-device interfaces, etc. 通常,在分析具有某种重复块的文本时需要使用它,例如,段落,网络设备接口等,这些重复块需要作为大块进行处理。

Once the nodes are chunked by taking the children of the <div id="hoge"> tag, then they're passed into map which turns them back into NodeSets, making it easy to continue treating them like we would normally in Nokogiri. 通过使用<div id="hoge">标记的<div id="hoge">对节点进行分块后,它们将被传递到map ,从而将它们重新转换为NodeSet,从而可以像在Nokogiri中一样正常地继续对待它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM