[英].NET Remove/Strip JavaScript and CSS code blocks from HTML page
I have HTML string with the JavaScript and CSS code blocks:我有带有 JavaScript 和 CSS 代码块的 HTML 字符串:
<script type="text/javascript">
alert('hello world');
</script>
<style type="text/css">
A:link {text-decoration: none}
A:visited {text-decoration: none}
A:active {text-decoration: none}
A:hover {text-decoration: underline; color: red;}
</style>
How to strip those blocks?如何剥离这些块? Any suggestion about the regular expressions that can be used to remove those?
关于可用于删除这些正则表达式的任何建议?
The quick 'n' dirty method would be a regex like this:快速的“n”脏方法将是这样的正则表达式:
var regex = new Regex(
"(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)",
RegexOptions.Singleline | RegexOptions.IgnoreCase
);
string ouput = regex.Replace(input, "");
The better* (but possibly slower) option would be to use HtmlAgilityPack :更好*(但可能更慢)的选项是使用HtmlAgilityPack :
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlInput);
var nodes = doc.DocumentNode.SelectNodes("//script|//style");
foreach (var node in nodes)
node.ParentNode.RemoveChild(node);
string htmlOutput = doc.DocumentNode.OuterHtml;
*) For a discussion about why it's better, see this thread . *) 有关为什么更好的讨论,请参阅此线程。
Use HTMLAgilityPack for better results使用 HTMLAgilityPack 获得更好的结果
or try this function或者试试这个功能
public string RemoveScriptAndStyle(string HTML)
{
string Pat = "<(script|style)\\b[^>]*?>.*?</\\1>";
return Regex.Replace(HTML, Pat, "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
}
I made my bike) He may not be as correct as HtmlAgilityPack but it is much faster by about 5-6 times on a page in the 400 kb.我做了我的自行车)他可能不如 HtmlAgilityPack 正确,但它在 400 kb 的页面上快了大约 5-6 倍。 Also make symbols lowercase and remove digits(made for tokenizer)
也使符号小写并删除数字(为标记器制作)
private static readonly List<byte[]> SPECIAL_TAGS = new List<byte[]>
{
Encoding.ASCII.GetBytes("script"),
Encoding.ASCII.GetBytes("style"),
Encoding.ASCII.GetBytes("noscript")
};
private static readonly List<byte[]> SPECIAL_TAGS_CLOSE = new List<byte[]>
{
Encoding.ASCII.GetBytes("/script"),
Encoding.ASCII.GetBytes("/style"),
Encoding.ASCII.GetBytes("/noscript")};
public static string StripTagsCharArray(string source, bool toLowerCase)
{
var array = new char[source.Length];
var arrayIndex = 0;
var inside = false;
var haveSpecialTags = false;
var compareIndex = -1;
var singleQouteMode = false;
var doubleQouteMode = false;
var matchMemory = SetDefaultMemory(SPECIAL_TAGS);
for (int i = 0; i < source.Length; i++)
{
var let = source[i];
if (inside && !singleQouteMode && !doubleQouteMode)
{
compareIndex++;
if (haveSpecialTags)
{
var endTag = CheckSpecialTags(let, compareIndex, SPECIAL_TAGS_CLOSE, ref matchMemory);
if (endTag) haveSpecialTags = false;
}
if (!haveSpecialTags)
{
haveSpecialTags = CheckSpecialTags(let, compareIndex, SPECIAL_TAGS, ref matchMemory);
}
}
if (haveSpecialTags && let == '"')
{
doubleQouteMode = !doubleQouteMode;
}
if (haveSpecialTags && let == '\'')
{
singleQouteMode = !singleQouteMode;
}
if (let == '<')
{
matchMemory = SetDefaultMemory(SPECIAL_TAGS);
compareIndex = -1;
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (inside) continue;
if (char.IsDigit(let)) continue;
if (haveSpecialTags) continue;
array[arrayIndex] = toLowerCase ? Char.ToLowerInvariant(let) : let;
arrayIndex++;
}
return new string(array, 0, arrayIndex);
}
private static bool[] SetDefaultMemory(List<byte[]> specialTags)
{
var memory = new bool[specialTags.Count];
for (int i = 0; i < memory.Length; i++)
{
memory[i] = true;
}
return memory;
}
Similar to Elian Ebbing's answer and Rajeev's answer, I opted for the more stable solution of using an HTML library, not regular expressions.与 Elian Ebbing 的回答和 Rajeev 的回答类似,我选择了使用 HTML 库的更稳定的解决方案,而不是正则表达式。 But instead of using HtmlAgilityPack I used AngleSharp , which gave me jquery-like selectors, in .NET Core 3:
但是我没有使用 HtmlAgilityPack,而是使用了AngleSharp ,它在 .NET Core 3 中为我提供了类似 jquery 的选择器:
//using AngleSharp;
var context = BrowsingContext.New(Configuration.Default);
var document = await context.OpenAsync(req => req.Content(sourceHtml)); // generate HTML DOM from source html string
var elems = document.QuerySelectorAll("script, style"); // get script and style elements
foreach(var elem in elems)
{
var parent = elem.Parent;
parent.RemoveChild(elem); // remove element from DOM
}
var resultHtml = document.DocumentElement.OuterHtml; // HTML result as a string
Just look for an opening <script
tag, and then remove everything between it and the closing /script>
tag.只需寻找一个开始的
<script
标签,然后删除它和结束的/script>
标签之间的所有内容。
Likewise for the style.风格也是一样。 See Google for string manipulation tips.
有关字符串操作提示, 请参阅 Google 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.