简体   繁体   English

HTML Strip功能

[英]HTML Strip Function

There is a tough nut to crack. 有一个很难破解的坚果。

I have a HTML which needs to be stripped of some tags, attributes AND properties . 我有一个HTML,需要删除一些标签,属性和属性

Basically there are three different approaches which are to be considered: 基本上,要考虑三种不同的方法:

  • String Operations: Iterate through the HTML string and strip it via string operations 'manually' 字符串操作:遍历HTML字符串,并通过“手动”字符串操作剥离它
  • Regex: Parsing HTML with RegEx is evil. 正则表达式: 使用RegEx解析HTML是邪恶的。 Is stripping HTML evil too? 剥离HTML也是邪恶的吗?
  • Using a library to strip it (eg HTML Agility Pack) 使用库来剥离它(例如HTML Agility Pack)

My wish is that I have lists for: 我希望我有以下清单:

  • acceptedTags (eg SPAN, DIV, OL, LI) acceptedTags(例如SPAN,DIV,OL,LI)
  • acceptedAttributes (eg STYLE, SRC) acceptedAttributes(例如STYLE,SRC)
  • acceptedProperties (eg TEXT-ALIGN, FONT-WEIGHT, COLOR, BACKGROUND-COLOR) 接受的属性(例如,文本对齐,字体重量,颜色,背景颜色)

Which I can pass to this function which strips the HTML. 我可以传递给剥离HTML的此函数。

Example Input: 输入示例:

<BODY STYLE="font-family:Tahoma;font-size:11;"> <DIV STYLE="margin:0 0 0 0;text-align:Left;font-family:Tahoma;font-size:16;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;font-family:tahoma;font-size:11;">Hello</SPAN></BODY>

Example Output (with parameter lists from above): 示例输出(带有上面的参数列表):

<DIV STYLE="text-align:Left;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;">Hello</SPAN>
  1. the entire tag Body is stripped (not accepted tag) 整个标签正文被剥离(不接受标签)
  2. properties margin, font-family and font-size are stripped from DIV-Tag 属性边距,字体系列和字体大小已从DIV标签中剥离
  3. properties font-family and font-size are stripped from SPAN-Tag. 从SPAN-Tag中删除了font-family和font-size属性。

What have I tried? 我尝试了什么?

Regex seemed to be the best approach at the first glance. 乍看之下,正则表达式似乎是最好的方法。 But I couldn't get it working properly. 但是我无法使其正常运行。 Articles on Stackoverflow I had a look at: 我看过关于Stackoverflow的文章:

...and many more. ...还有很多。

I tried the following regex: 我尝试了以下正则表达式:

Dim AcceptableTags As String = "font|span|html|i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
            Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
                  ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
Dim Html as String = Regex.Replace(b.HTML, WhiteListPattern, "", RegexOptions.Compiled)

However, this is only removing tags and no attributes or properties! 但是,这仅删除标签,而没有属性或属性!

I'm definitely not looking for someone who's doing the whole job. 我绝对不是在寻找可以完成整个工作的人。 Rather for someone, who points me to the right direction. 而是为某人指出了我正确的方向。

I'm happy with either C# or VB.NET as answers. 我对C#或VB.NET的回答感到满意。

Definitely use a library! 绝对使用图书馆! (See this ) (看这个

With HTMLAgilityPack you can do pretty much everything you want: 有了HTMLAgilityPack,您几乎可以做任何您想做的事情:

  1. Remove tags you don't want: 删除不需要的标签:

     string[] allowedTags = {"SPAN", "DIV", "OL", "LI"}; foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()")) { if (!allowedTags.Contains(node.Name.ToUpper())) { HtmlNode parent = node.ParentNode; parent.RemoveChild(node,true); } } 
  2. Remove attributes you don't want & remove properties 删除不需要的属性并删除属性

     string[] allowedAttributes = { "STYLE", "SRC" }; foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()")) { List<HtmlAttribute> attributesToRemove = new List<HtmlAttribute>(); foreach (HtmlAttribute att in node.Attributes) { if (!allowedAttributes.Contains(att.Name.ToUpper())) attributesToRemove.Add(att); else { string newAttrib = string.Empty; //do string manipulation based on your checking accepted properties //one way would be to split the attribute.Value by a semicolon and do a //String.Contains() on each one, not appending those that don't match. Maybe //use a StringBuilder instead too att.Value = newAttrib; } } foreach (HtmlAttribute attribute in attributesToRemove) { node.Attributes.Remove(attribute); } } 

I would probably actually just write this myself as a multi-step process: 我实际上可能只是将自己写成一个多步骤过程:

1) Exclude all rules for removing properties from tags that are listed as tags to be removed (the tags won't be there anyway!) 1)排除所有从标记中删除要删除的标记的属性的规则(标记将不会在那里!)

2) Walk the document, taking a copy of the document without excluded tags (ie in your example, copy everything up until "< div" then wait until I see ">" before continuing to copy. If I'm in copy mode, and I see "ExcludedTag=" then stop copying until I see quotation mark. 2)遍历文档,获取文档的副本(不含排除的标记)(即,在您的示例中,将所有内容复制到“ <div”,然后等待直到看到“>”,然后再继续复制。如果我处于复制模式,并且看到“ ExcludedTag =“,然后停止复制,直到看到引号为止。

You'll probably want to do some pre-work validation on the html and getting the formatting the same, etc. before running this process to avoid broken output. 在运行此过程之前,您可能需要对html进行一些工作前验证,并设置相同的格式等,以避免输出损坏。

Oh, and copy in chunks, ie just keep the index of copy start until you reach copy end, then copy the whole chunk, not individual characters! 哦,分块复制,即只保留复制索引直到到达复制结束,然后复制整个块,而不是单个字符!

Hopefully that helps as a starting point. 希望这可以作为起点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM