[英]Parse HTML (without HTML semantics being followed) using Nokogiri
I have an HTML document containing data:我有一个包含数据的 HTML 文档:
<div>
<p class="someclass">
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
</p>
</div>
while parsing I use:解析时我使用:
div_node.children.each do |child|
if child.node_name == 'p'
#store it as html string in db
store(child.to_html)
end
end
When I check the database, I get only the outer <p>
tag:当我检查数据库时,我只得到外面的<p>
标签:
<p class="someclass">
</p>
No inner <ul>
tag content is stored or retrieved.没有存储或检索内部<ul>
标签内容。
I know that the <p>
tag cannot contain the <ul>
tag but the document we got from the client has the data and there are about 1000 documents with the data so I cannot edit them manually我知道<p>
标签不能包含<ul>
标签,但是我们从客户端得到的文档有数据,并且有大约 1000 个带有数据的文档,所以我无法手动编辑它们
Try to use the Nokogiri::XML
parser instead of the Nokogiri::HTML
one.尝试使用Nokogiri::XML
解析器而不是Nokogiri::HTML
解析器。 It shouldn't care about the tag semantics, but I'm not sure how will it handle those parts of HTML5 which are not valid XML.它不应该关心标签语义,但我不确定它将如何处理 HTML5 中不是有效 XML 的那些部分。
I ended up using Nokogiri::XML
parser for parsing the HTML
doc我最终使用Nokogiri::XML
解析器来解析HTML
文档
I had to change my script at numerous places我不得不在很多地方更改我的脚本
Parsing code解析代码
@xml_doc = Nokogiri::XML.parse(file) { |cfg| cfg.noblanks }
@xml_doc.remove_namespaces!
Changes Done更改完成
attribute
method to attr
将attribute
方法更改为attr
attr
with text
method is not needed here这里不需要用text
方法链接attr
node.to_html
works like a charm here so i was able to store complete HTML in db node.to_html
在这里就像一个魅力,所以我能够在 db 中存储完整的 HTML
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.