簡體   English   中英

沒有 html 元素的 Nokogiri 解析表

[英]Nokogiri parsing table with no html element

我有這段代碼試圖轉到一個 URL 並將 'li' 元素解析為一個數組。 但是,在嘗試解析不在“b”標簽中的任何內容時,我遇到了問題。

代碼:

url = '(some URL)'
page = Nokogiri::HTML(open(url))
csv = CSV.open("/tmp/output.csv", 'w')

page.search('//li[not(@id) and not(@class)]').each do |row|
  arr = []
  row.search('b').each do |cell|
    arr << cell.text
  end
  csv << arr
  pp arr
end

HTML:

<li><b>The Company Name</b><br>
The Street<br>
The City, 
The State 
The Zipcode<br><br>
</li>

我想解析所有元素,以便輸出如下所示:

["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"]
require 'nokogiri'

def main
  output = []
  page = File.open("parse.html") {|f| Nokogiri::HTML(f)}
  page.search("//li[not(@id) and not (@class)]").each do |row|
    arr = []
    result = row.text
    result.each_line { |l|
      if l.strip.length > 0
        arr << l.strip
      end
    }
    output << arr
  end
  print output
end

if __FILE__ == $PROGRAM_NAME
  main()
end

我最終找到了我自己問題的解決方案,所以如果有人感興趣,我只是改變了

row.search('b').each do |cell|

進入:

row.search('text()'.each do |cell|

我也變了

arr << cell.text

進入:

arr << cell.text.gsub("\n", '').gsub("\r", '') 

為了刪除所有出現在輸出中的 \\n 和 \\r。

根據您的 HTML,我會這樣做:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<ol>
<li><b>The Company Name</b><br>
The Street<br>
The City, 
The State 
The Zipcode<br><br>
</li>
<li><b>The Company Name</b><br>
The Street<br>
The City, 
The State 
The Zipcode<br><br>
</li>
</ol>
EOT

doc.search('li').map{ |li|
  text = li.text.split("\n").map(&:strip)
}
# => [["The Company Name",
#      "The Street",
#      "The City,",
#      "The State",
#      "The Zipcode"],
#     ["The Company Name",
#      "The Street",
#      "The City,",
#      "The State",
#      "The Zipcode"]]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM