用 Nokogiri 迭代一个 HTML 块，不管元素类型如何？

Question

I'm trying to iterate a block of HTML with Nokogiri, regardless of what the element type is.我试图用 Nokogiri 迭代一段 HTML，不管元素类型是什么。

For example, given this variable html , passed through Nokogiri:例如，给定这个变量html ，通过 Nokogiri 传递：

require 'nokogiri'

html = "<p>Some text</p><ol><li>List item 1</li><li>List item 2</li></ol><p>Last bit of text</p>"

parsed_html = Nokogiri::HTML(html)

I know I can iterate over each <p> by doing:我知道我可以通过执行以下操作来遍历每个<p> ：

parsed_html.css("p").each do |p|
  puts p
end

But again that only grabs all <p> tags and not the <ol> and its children.但同样，这只会抓取所有<p>标签，而不是<ol>及其子标签。

I also know I can grab the <ol> by doing:我也知道我可以通过执行以下操作来获取<ol> ：

parsed_html.css("p, ol").each do |p|
  puts p
end

But how can I iterate over all the elements regardless of explicitly stating which ones I want to iterate over?但是如何迭代所有元素而不管明确说明我想迭代哪些元素？

For example, given another html block:例如，给定另一个 html 块：

html = "<p>text 1</p><ol><li>item 1</li><li>item 2</li></ol><ul><li>item 1</li></ul><h2>header</h2>"

How can I return something like:我怎样才能返回类似的东西：

<p>text 1</p>
<ol><li>item 1</li><li>item 2</li></ol>
<ul><li>item 1</li></ul>
<h2>header</h2>

Thanks in advance.提前致谢。

Answer 1

Use the CSS child selector :使用CSS 子选择器：

parsed_html.css('body > *')

This selects only direct children of the element(s).这仅选择元素的直接子元素。

irb(main):015:0> parsed_html = Nokogiri::HTML(html)
irb(main):016:0> parsed_html.css('body > *')
=> [#<Nokogiri::XML::Element:0x3c00 name="p" children=[#<Nokogiri::XML::Text:0x3bec "text 1">]>, #<Nokogiri::XML::Element:0x3c64 name="ol" children=[#<Nokogiri::XML::Element:0x3c28 name="li" children=[#<Nokogiri::XML::Text:0x3c14 "item 1">]>, #<Nokogiri::XML::Element:0x3c50 name="li" children=[#<Nokogiri::XML::Text:0x3c3c "item 2">]>]>, #<Nokogiri::XML::Element:0x3ca0 name="ul" children=[#<Nokogiri::XML::Element:0x3c8c name="li" children=[#<Nokogiri::XML::Text:0x3c78 "item 1">]>]>, #<Nokogiri::XML::Element:0x3cc8 name="h2" children=[#<Nokogiri::XML::Text:0x3cb4 "header">]>]
irb(main):017:0> parsed_html.css('body > *').map {|e| e.name }
=> ["p", "ol", "ul", "h2"]

This works since Nokogiri will create a skeleton when you use Nokogiri::HTML:这是有效的，因为当您使用 Nokogiri::HTML 时，Nokogiri 将创建一个骨架：

irb(main):018:0> parsed_html.to_s
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<p>text 1</p>\n<ol>\n<li>item 1</li>\n<li>item 2</li>\n</ol>\n<ul><li>item 1</li></ul>\n<h2>header</h2>\n</body></html>\n"

You can also just use Nokogiri::HTML.fragment instead of HTML() :你也可以只使用Nokogiri::HTML.fragment而不是HTML() ：

frag = Nokogiri::HTML.fragment(html)
frag.children.map(&:to_html).join("\n")

Answer 2

Just answering the questions you wrote:只回答你写的问题：

how can I iterate over all the elements如何遍历所有元素

CSS accepts wildcards, so you can just: CSS 接受通配符，因此您可以：

Nokogiri::HTML(html).css("*").map(&:name)
# => ["html", "body", "p", "ol", "li", "li", "p"]

given "this html" how do I return "something like"给定“这个 html”，我如何返回“类似的东西”

html = "<p>text 1</p><ol><li>item 1</li><li>item 2</li></ol><ul><li>item 1</li></ul><h2>header</h2>"

puts Nokogiri::HTML(html).css('body').inner_html
# <p>text 1</p>
# <ol>
# <li>item 1</li>
# <li>item 2</li>
# </ol>
# <ul><li>item 1</li></ul>
# <h2>header</h2>

I want to be able to iterate over all the first level child elements (p, ol, ul, h2)我希望能够遍历所有第一级子元素（p、ol、ul、h2）

Nokogiri::HTML(html).css('body').children.map(&:name)
# => ["p", "ol", "ul", "h2"]

用 Nokogiri 迭代一个 HTML 块，不管元素类型如何？

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-10-02 10:21:36

解决方案2
0 2020-09-30 19:59:50

用 Nokogiri 迭代一个 HTML 块，不管元素类型如何？

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-10-02 10:21:36

解决方案2 0 2020-09-30 19:59:50

解决方案1
1 已采纳 2020-10-02 10:21:36

解决方案2
0 2020-09-30 19:59:50