如何使用機械化和Nokogiri紅寶石獲取鏈接

Question

給定下面的示例，誰能告訴我如何使用Nokogiri和Mechanize來將每個<h4>標記下的所有鏈接分成單獨的組，即IE下的所有鏈接：

“一些文字”
“更多文字”
“一些其他文字”

<div id="right_holder">
    <h3><a href="#"><img src="http://example.com" width="11" height="11"></a></h3>
    <br />
    <br />
    <h4><a href="#">Some text</a></h4>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <br />
    <br />
    <h4><a href="#">Some more text</a></h4>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <br />
    <br />
    <h4><a href="#">Some additional text</a></h4>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
</div>

Answer 1

通常，您會這樣做：

page.search('h4 a').each do |a|
  puts a[:href]
end

但是我敢肯定，您已經注意到這些鏈接實際上都沒有任何地方。

更新：

要將它們分組，以一些節點集數學如何：

page.search('h4').each do |h4|
  puts h4.text
  (h4.search('~ a') - h4.search('~ h4 ~ a')).each do |a|
    puts a.text
  end
end

這意味着每a下面的h4 ，不也跟着另一個h4

Answer 2

您可以遍歷並分離數據，例如“ 如何使用Nokogiri拆分HTML文檔？ ”，但是如果您知道標簽將是什么，則可以split它：

# html is the raw html string
html.split('<h4').map{|g| Nokogiri::HTML::DocumentFragment.parse(g).css('a') }

page = Nokogiri::HTML(html).css("#right_holder")
links = page.children.inject([]) do |link_hash, child|
  if child.name == 'h4'
    name = child.text
    link_hash << { :name => name, :content => ""}
  end

  next link_hash if link_hash.empty?
  link_hash.last[:content] << child.to_xhtml
  link_hash
end

grouped_hsh = links.inject({}) do |hsh, link|
  hsh[link[:name]] = Nokogiri::HTML::DocumentFragment.parse(link[:content]).css('a')
  hsh
end

# {"Some text"=>[#<Nokogiri::XML::Element:0x3ff4860d6c30,
#  "Some more text"=>[#<Nokogiri::XML::Element:0x3ff486096c20...,
#  "Some additional text"=>[#<Nokogiri::XML::Element:0x3ff486f2de78...}

如何使用機械化和Nokogiri紅寶石獲取鏈接

問題描述

2 個解決方案

解決方案1
2 2015-04-17 22:53:06

解決方案2
1 已采納 2015-04-17 22:05:09

如何使用機械化和Nokogiri紅寶石獲取鏈接

問題描述

2 個解決方案

解決方案1 2 2015-04-17 22:53:06

解決方案2 1 已采納 2015-04-17 22:05:09

解決方案1
2 2015-04-17 22:53:06

解決方案2
1 已采納 2015-04-17 22:05:09