简体   繁体   English

使用 Nokogiri 解析 HTML(不遵循 HTML 语义)

[英]Parse HTML (without HTML semantics being followed) using Nokogiri

I have an HTML document containing data:我有一个包含数据的 HTML 文档:

<div>
    <p class="someclass">
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
        </ul>
    </p>
</div>

while parsing I use:解析时我使用:

div_node.children.each do |child|
  if child.node_name == 'p'
    #store it as html string in db
    store(child.to_html)
  end
end

When I check the database, I get only the outer <p> tag:当我检查数据库时,我只得到外面的<p>标签:

<p class="someclass">
</p>

No inner <ul> tag content is stored or retrieved.没有存储或检索内部<ul>标签内容。

I know that the <p> tag cannot contain the <ul> tag but the document we got from the client has the data and there are about 1000 documents with the data so I cannot edit them manually我知道<p>标签不能包含<ul>标签,但是我们从客户端得到的文档有数据,并且有大约 1000 个带有数据的文档,所以我无法手动编辑它们

Try to use the Nokogiri::XML parser instead of the Nokogiri::HTML one.尝试使用Nokogiri::XML解析器而不是Nokogiri::HTML解析器。 It shouldn't care about the tag semantics, but I'm not sure how will it handle those parts of HTML5 which are not valid XML.它不应该关心标签语义,但我不确定它将如何处理 HTML5 中不是有效 XML 的那些部分。

I ended up using Nokogiri::XML parser for parsing the HTML doc我最终使用Nokogiri::XML解析器来解析HTML文档

I had to change my script at numerous places我不得不在很多地方更改我的脚本

Parsing code解析代码

@xml_doc = Nokogiri::XML.parse(file) { |cfg| cfg.noblanks }
@xml_doc.remove_namespaces!

Changes Done更改完成

  • change attribute method to attrattribute方法更改为attr
  • chaining attr with text method is not needed here这里不需要用text方法链接attr
  • need to check about the invalid HTML5 tags though虽然需要检查无效的 HTML5 标签
  • some more parsing logic changes were needed需要更多的解析逻辑更改
  • node.to_html works like a charm here so i was able to store complete HTML in db node.to_html在这里就像一个魅力,所以我能够在 db 中存储完整的 HTML

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM