从字符串 - 注释中删除HTML

Question

I have the following text which still contains some HTML code: 我有以下文本仍然包含一些HTML代码：

v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}


Hi There,
 
For the product team to have any chance in analysing this issue we need clarification on how to reproduce the problem.

My code at the moment is: 我的代码目前是：

string replacedEmailText = Regex.Replace(emailText, @"<(.|\n)*?>", string.Empty);
string finalText = WebUtility.HtmlDecode(replacedEmailText);

How do I remove the top lines containing : 如何删除包含以下内容的顶行：

v\:* {behavior:url(#default#VML);}

? ？

Answer 1

For this specific example, you could use .*;}(\\r\\n|\\r|\\n)* as your replacement pattern. 对于此特定示例，您可以使用.*;}(\\r\\n|\\r|\\n)*作为替换模式。

However, this will fail, when the text contains the sequence ;} . 但是，当文本包含序列时，这将失败;} 。 If this is possible, you might want to go further into detail on how the html lines look like: 如果可以，您可能希望进一步详细了解html行的外观：

.*\\(#default#VML\\);}(\\r\\n|\\r|\\n)*

Explanation: 说明：

.* : matches any character except for new line and carriage return zero ore more consecutive times .* ：匹配任何字符，除了新行和回车零连续多次
\\(#default#VML\\);} : matches the sequence (#default#VML) \\(#default#VML\\);} ：匹配序列（#default＃VML）
(\\r\\n|\\r|\\n)* : removes new line and carriage return zero or more consecutive times (\\r\\n|\\r|\\n)* ：删除新行和回车连续零次或多次

Demo here 在这里演示

Answer 2

不要尝试使用正则表达式从文本中删除HTML，使用一些白名单库，如https://github.com/mganss/HtmlSanitizer

从字符串 - 注释中删除HTML

问题描述

2 个解决方案

解决方案1
0 已采纳 2019-06-06 08:25:42

解决方案2
0 2019-06-06 08:39:02

从字符串 - 注释中删除HTML

问题描述

2 个解决方案

解决方案1 0 已采纳 2019-06-06 08:25:42

解决方案2 0 2019-06-06 08:39:02

解决方案1
0 已采纳 2019-06-06 08:25:42

解决方案2
0 2019-06-06 08:39:02