没有 html 元素的 Nokogiri 解析表

Question

I have this code that attempts to go to a URL and parse 'li' elements into an array.我有这段代码试图转到一个 URL 并将 'li' 元素解析为一个数组。 However I have run into a problem when trying to parse anything that is not in a 'b' tag.但是，在尝试解析不在“b”标签中的任何内容时，我遇到了问题。

Code:代码：

url = '(some URL)'
page = Nokogiri::HTML(open(url))
csv = CSV.open("/tmp/output.csv", 'w')

page.search('//li[not(@id) and not(@class)]').each do |row|
  arr = []
  row.search('b').each do |cell|
    arr << cell.text
  end
  csv << arr
  pp arr
end

HTML: HTML：

<li><b>The Company Name</b><br>
The Street<br>
The City, 
The State 
The Zipcode<br><br>
</li>

I would like to parse all of the elements so that the output would be something like this:我想解析所有元素，以便输出如下所示：

["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"]

Answer 1

require 'nokogiri'

def main
  output = []
  page = File.open("parse.html") {|f| Nokogiri::HTML(f)}
  page.search("//li[not(@id) and not (@class)]").each do |row|
    arr = []
    result = row.text
    result.each_line { |l|
      if l.strip.length > 0
        arr << l.strip
      end
    }
    output << arr
  end
  print output
end

if __FILE__ == $PROGRAM_NAME
  main()
end

Answer 2

I ended up finding the solution to my own question so if anyone is interested I simply changed我最终找到了我自己问题的解决方案，所以如果有人感兴趣，我只是改变了

row.search('b').each do |cell|

into:进入：

row.search('text()'.each do |cell|

I also changed我也变了

arr << cell.text

into:进入：

arr << cell.text.gsub("\n", '').gsub("\r", '')

in order to remove all the \\n and the \\r that were present in the output.为了删除所有出现在输出中的 \\n 和 \\r。

Answer 3

Based on your HTML I'd do it like:根据您的 HTML，我会这样做：

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<ol>
<li><b>The Company Name</b><br>
The Street<br>
The City, 
The State 
The Zipcode<br><br>
</li>
<li><b>The Company Name</b><br>
The Street<br>
The City, 
The State 
The Zipcode<br><br>
</li>
</ol>
EOT

doc.search('li').map{ |li|
  text = li.text.split("\n").map(&:strip)
}
# => [["The Company Name",
#      "The Street",
#      "The City,",
#      "The State",
#      "The Zipcode"],
#     ["The Company Name",
#      "The Street",
#      "The City,",
#      "The State",
#      "The Zipcode"]]

没有 html 元素的 Nokogiri 解析表

问题描述

3 个解决方案

解决方案1
1 2016-05-27 00:31:30

解决方案2
0 2016-05-27 00:03:58

解决方案3
0 2017-03-03 01:30:47

没有 html 元素的 Nokogiri 解析表

问题描述

3 个解决方案

解决方案1 1 2016-05-27 00:31:30

解决方案2 0 2016-05-27 00:03:58

解决方案3 0 2017-03-03 01:30:47

解决方案1
1 2016-05-27 00:31:30

解决方案2
0 2016-05-27 00:03:58

解决方案3
0 2017-03-03 01:30:47