使用 Nokogiri 解析 HTML（不遵循 HTML 语义）

Question

I have an HTML document containing data:我有一个包含数据的 HTML 文档：

<div>
    <p class="someclass">
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
        </ul>
    </p>
</div>

while parsing I use:解析时我使用：

div_node.children.each do |child|
  if child.node_name == 'p'
    #store it as html string in db
    store(child.to_html)
  end
end

When I check the database, I get only the outer <p> tag:当我检查数据库时，我只得到外面的<p>标签：

<p class="someclass">
</p>

No inner <ul> tag content is stored or retrieved.没有存储或检索内部<ul>标签内容。

I know that the <p> tag cannot contain the <ul> tag but the document we got from the client has the data and there are about 1000 documents with the data so I cannot edit them manually我知道<p>标签不能包含<ul>标签，但是我们从客户端得到的文档有数据，并且有大约 1000 个带有数据的文档，所以我无法手动编辑它们

Answer 1

Try to use the Nokogiri::XML parser instead of the Nokogiri::HTML one.尝试使用Nokogiri::XML解析器而不是Nokogiri::HTML解析器。 It shouldn't care about the tag semantics, but I'm not sure how will it handle those parts of HTML5 which are not valid XML.它不应该关心标签语义，但我不确定它将如何处理 HTML5 中不是有效 XML 的那些部分。

Answer 2

I ended up using Nokogiri::XML parser for parsing the HTML doc我最终使用Nokogiri::XML解析器来解析HTML文档

I had to change my script at numerous places我不得不在很多地方更改我的脚本

Parsing code解析代码

@xml_doc = Nokogiri::XML.parse(file) { |cfg| cfg.noblanks }
@xml_doc.remove_namespaces!

Changes Done更改完成

change attribute method to attr将attribute方法更改为attr
chaining attr with text method is not needed here这里不需要用text方法链接attr
need to check about the invalid HTML5 tags though虽然需要检查无效的 HTML5 标签
some more parsing logic changes were needed需要更多的解析逻辑更改
node.to_html works like a charm here so i was able to store complete HTML in db node.to_html在这里就像一个魅力，所以我能够在 db 中存储完整的 HTML

使用 Nokogiri 解析 HTML（不遵循 HTML 语义）

问题描述

2 个解决方案

解决方案1
1 2015-11-19 13:56:31

解决方案2
1 已采纳 2015-11-20 09:37:01

使用 Nokogiri 解析 HTML（不遵循 HTML 语义）

问题描述

2 个解决方案

解决方案1 1 2015-11-19 13:56:31

解决方案2 1 已采纳 2015-11-20 09:37:01

解决方案1
1 2015-11-19 13:56:31

解决方案2
1 已采纳 2015-11-20 09:37:01