Remove HTML from string — comments

Question

I have the following text which still contains some HTML code:

v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}


Hi There,
 
For the product team to have any chance in analysing this issue we need clarification on how to reproduce the problem.

My code at the moment is:

string replacedEmailText = Regex.Replace(emailText, @"<(.|\n)*?>", string.Empty);
string finalText = WebUtility.HtmlDecode(replacedEmailText);

How do I remove the top lines containing :

v\:* {behavior:url(#default#VML);}

?

Answer 1

For this specific example, you could use .*;}(\\r\\n|\\r|\\n)* as your replacement pattern.

However, this will fail, when the text contains the sequence ;} . If this is possible, you might want to go further into detail on how the html lines look like:

.*\\(#default#VML\\);}(\\r\\n|\\r|\\n)*

Explanation:

.* : matches any character except for new line and carriage return zero ore more consecutive times
\\(#default#VML\\);} : matches the sequence (#default#VML)
(\\r\\n|\\r|\\n)* : removes new line and carriage return zero or more consecutive times

Demo here

Answer 2

不要尝试使用正则表达式从文本中删除HTML，使用一些白名单库，如https://github.com/mganss/HtmlSanitizer

Remove HTML from string — comments

Question

2 answers

solution1
0 ACCPTED 2019-06-06 08:25:42

solution2
0 2019-06-06 08:39:02

Remove HTML from string — comments

Question

2 answers

solution1 0 ACCPTED 2019-06-06 08:25:42

solution2 0 2019-06-06 08:39:02

solution1
0 ACCPTED 2019-06-06 08:25:42

solution2
0 2019-06-06 08:39:02