[英]parse html tree with nested loops using nokogiri
嗨,我是nokogiri的新手,正在嘗試解析具有不同樹結構的HTML文檔。 關於如何進行解析的任何建議都很好。 我想捕獲此頁面上的所有文本。
<div class = "main"> Title</div>
<div class = "subTopic">
<span = "highlight">Sub Topic</span>Stuff
</div>
<div class = "main"> Another Title</div>
<div class = "subTopic">
<span class = "highlight">Sub Topic Title I</span>Stuff<br>
<span class = "highlight">Sub Topic Title II</span>Stuff<br>
<span class = "highlight">Sub Topic Title III</span>Stuff<br>
</div>
我試過了,但是它只是把每個完整的數組都取出來了,我什至不知道如何進入“ Stuff”部分。
content = Nokogiri::HTML(open(@url))
content.css('div.main').each do |m|
puts m .text
content.css('div.subTopic').each do |s|
puts s.text
content.css('span.highlight').each do |h|
puts h.text
end
end
end
幫助將不勝感激。
這樣的事情將解析您的給定文檔結構:
數據
<div class="main"> Title</div>
<div class="subTopic">
<span class="highlight">Sub Topic</span>Stuff
</div>
<div class = "main"> Another Title</div>
<div class = "subTopic">
<span class = "highlight">Sub Topic Title I</span>Stuff<br>
<span class = "highlight">Sub Topic Title II</span>Stuff<br>
<span class = "highlight">Sub Topic Title III</span>Stuff<br>
</div>
碼:
require 'nokogiri'
require 'pp'
content = Nokogiri::HTML(File.read('text.txt'));
topics = content.css('div.main').map do |m|
topic={}
topic['title'] = m.text.strip
topic['highlights'] = m.xpath('following-sibling::div[@class=\'subTopic\'][1]').css('span.highlight').map do |h|
topic_highlight = {}
topic_highlight['highlight'] = h.text.strip
topic_highlight['text'] = h.xpath('following-sibling::text()[1]').text.strip
topic_highlight
end
topic
end
pp topics
將打印:
[{"title"=>"Title",
"highlights"=>[{"highlight"=>"Sub Topic", "text"=>"Stuff"}]},
{"title"=>"Another Title",
"highlights"=>
[{"highlight"=>"Sub Topic Title I", "text"=>"Stuff"},
{"highlight"=>"Sub Topic Title II", "text"=>"Stuff"},
{"highlight"=>"Sub Topic Title III", "text"=>"Stuff"}]}]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.