简体   繁体   English

如何使用Nokogiri在页面上创建HTML标签结构的轮廓?

[英]How do I create an outline of the HTML tag structure on the page using Nokogiri?

I am trying to create an outline of the tag structure of an HTML page using Nokogiri that I can use as an indicator whether an html page's content has changed. 我正在尝试使用Nokogiri创建HTML页面标签结构的轮廓,我可以将其用作指示html页面内容是否已更改的指标。

To do this, basically I want to strip all the text out, and just have the HTML tags remaining (without attributes). 为此,基本上我想将所有文本剥离掉,而只保留HTML标签(没有属性)。

The idea is to use this as a sketch of the page, one of a few I use, to see if the page has changed. 想法是将其用作页面的草图,这是我使用的少数草图之一,以查看页面是否已更改。

When I'm done, I want the "sketch" to look roughly like 完成后,我希望“草图”看起来像

<html><head></head><body><div></div><p><div></div></p></body></html>

So that it can be compared against revisions to see if the page structure has changed. 这样就可以将其与修订进行比较,以查看页面结构是否已更改。

There are a ton of examples of how to parse the dom in Nokogiri. 关于Nokogiri中如何解析dom的例子很多。 But, how about just listing it? 但是,仅列出它呢?

Any thoughts anyone? 有任何想法吗?

Something like this would do: 这样的事情会做:

class Nokogiri::XML::Node

  def to_sketch
    children.find_all(&:element?).map(&:to_sketch).join
  end
end

class Nokogiri::XML::Element
  def to_sketch
    "<#{name}>#{super}</#{name}>"
  end
end

EDIT An example 编辑一个例子

require 'nokogiri'
require 'open-uri'
Nokogiri::HTML(open('http://google.com')).to_sketch

Returns: 返回值:

"<html><head><meta></meta><title></title><script></script><style></style><script></script></head><body><textarea></textarea><div><div><nobr><b></b><a></a><a></a><a></a><a></a><a></a><a></a><a><u></u></a></nobr></div><div><nobr><span></span><span></span><span><a></a></span><a></a><a></a></nobr></div><div></div><div></div></div><center><br></br><div><a><img></img></a><br></br><br></br></div><form><table><tr><td></td><td><input></input><input></input><input></input><div><input></input></div><br></br><span><span><input></input></span></span><span><span><input></input></span></span></td><td><a></a><a></a></td></tr></table></form><div><br></br><div><font><a></a><a></a><a></a></font><br></br><br></br></div></div><div></div><span><center><div><div><a></a><a></a><a></a><a></a></div></div><p><a></a></p></center></span><div></div><div><script></script></div><script></script><script></script></center></body></html>"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM