[英]Nokogiri parsing table with no html element
I have this code that attempts to go to a URL and parse 'li' elements into an array.我有这段代码试图转到一个 URL 并将 'li' 元素解析为一个数组。 However I have run into a problem when trying to parse anything that is not in a 'b' tag.
但是,在尝试解析不在“b”标签中的任何内容时,我遇到了问题。
Code:代码:
url = '(some URL)'
page = Nokogiri::HTML(open(url))
csv = CSV.open("/tmp/output.csv", 'w')
page.search('//li[not(@id) and not(@class)]').each do |row|
arr = []
row.search('b').each do |cell|
arr << cell.text
end
csv << arr
pp arr
end
HTML: HTML:
<li><b>The Company Name</b><br>
The Street<br>
The City,
The State
The Zipcode<br><br>
</li>
I would like to parse all of the elements so that the output would be something like this:我想解析所有元素,以便输出如下所示:
["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"]
require 'nokogiri'
def main
output = []
page = File.open("parse.html") {|f| Nokogiri::HTML(f)}
page.search("//li[not(@id) and not (@class)]").each do |row|
arr = []
result = row.text
result.each_line { |l|
if l.strip.length > 0
arr << l.strip
end
}
output << arr
end
print output
end
if __FILE__ == $PROGRAM_NAME
main()
end
I ended up finding the solution to my own question so if anyone is interested I simply changed我最终找到了我自己问题的解决方案,所以如果有人感兴趣,我只是改变了
row.search('b').each do |cell|
into:进入:
row.search('text()'.each do |cell|
I also changed我也变了
arr << cell.text
into:进入:
arr << cell.text.gsub("\n", '').gsub("\r", '')
in order to remove all the \\n and the \\r that were present in the output.为了删除所有出现在输出中的 \\n 和 \\r。
Based on your HTML I'd do it like:根据您的 HTML,我会这样做:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<ol>
<li><b>The Company Name</b><br>
The Street<br>
The City,
The State
The Zipcode<br><br>
</li>
<li><b>The Company Name</b><br>
The Street<br>
The City,
The State
The Zipcode<br><br>
</li>
</ol>
EOT
doc.search('li').map{ |li|
text = li.text.split("\n").map(&:strip)
}
# => [["The Company Name",
# "The Street",
# "The City,",
# "The State",
# "The Zipcode"],
# ["The Company Name",
# "The Street",
# "The City,",
# "The State",
# "The Zipcode"]]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.