[英]How to get links using mechanize and nokogiri ruby
給定下面的示例,誰能告訴我如何使用Nokogiri和Mechanize來將每個<h4>
標記下的所有鏈接分成單獨的組,即IE下的所有鏈接:
<div id="right_holder">
<h3><a href="#"><img src="http://example.com" width="11" height="11"></a></h3>
<br />
<br />
<h4><a href="#">Some text</a></h4>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<br />
<br />
<h4><a href="#">Some more text</a></h4>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<br />
<br />
<h4><a href="#">Some additional text</a></h4>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
<a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
</div>
通常,您會這樣做:
page.search('h4 a').each do |a|
puts a[:href]
end
但是我敢肯定,您已經注意到這些鏈接實際上都沒有任何地方。
更新:
要將它們分組,以一些節點集數學如何:
page.search('h4').each do |h4|
puts h4.text
(h4.search('~ a') - h4.search('~ h4 ~ a')).each do |a|
puts a.text
end
end
這意味着每a
下面的h4
,不也跟着另一個h4
您可以遍歷並分離數據,例如“ 如何使用Nokogiri拆分HTML文檔? ”,但是如果您知道標簽將是什么,則可以split
它:
# html is the raw html string
html.split('<h4').map{|g| Nokogiri::HTML::DocumentFragment.parse(g).css('a') }
page = Nokogiri::HTML(html).css("#right_holder")
links = page.children.inject([]) do |link_hash, child|
if child.name == 'h4'
name = child.text
link_hash << { :name => name, :content => ""}
end
next link_hash if link_hash.empty?
link_hash.last[:content] << child.to_xhtml
link_hash
end
grouped_hsh = links.inject({}) do |hsh, link|
hsh[link[:name]] = Nokogiri::HTML::DocumentFragment.parse(link[:content]).css('a')
hsh
end
# {"Some text"=>[#<Nokogiri::XML::Element:0x3ff4860d6c30,
# "Some more text"=>[#<Nokogiri::XML::Element:0x3ff486096c20...,
# "Some additional text"=>[#<Nokogiri::XML::Element:0x3ff486f2de78...}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.