[英]Remove HTML tags and comments from a string in C#?
How do I remove everything beginning in '<' and ending in '>' from a string in C#. 如何从C#中的字符串中删除以“<”开头并以“>”结尾的所有内容。 I know it can be done with regex but I'm not very good with it. 我知道它可以用正则表达式完成,但我对它不是很好。
The tag pattern I quickly wrote for a recent small project is this one. 我最近为一个小项目写的标签模式就是这个。
string tagPattern = @"<[!--\W*?]*?[/]*?\w+.*?>";
I used it like this 我这样用它
MatchCollection matches = Regex.Matches(input, tagPattern);
foreach (Match match in matches)
{
input = input.Replace(match.Value, string.Empty);
}
It would likely need to be modified to correctly handle script or style tags. 可能需要修改它才能正确处理脚本或样式标记。
Another non-regex code that works 8x faster than regex: 另一个非正则表达式的代码比正则表达式快8倍:
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
char let = source[i];
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
Non regex option: But it still won't parse nested tags! 非正则表达式选项:但它仍然不会解析嵌套标签!
public static string StripHTML(string line)
{
int finished = 0;
int beginStrip;
int endStrip;
finished = line.IndexOf('<');
while (finished != -1)
{
beginStrip = line.IndexOf('<');
endStrip = line.IndexOf('>', beginStrip + 1);
line = line.Remove(beginStrip, (endStrip + 1) - beginStrip);
finished = line.IndexOf('<');
}
return line;
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.