使用Regex刪除腳本標記

Question

我正在嘗試使用我在本網站上找到的正則表達式，它似乎不起作用。 有任何想法嗎？

輸入字符串 ：

sFetch = "123<script type=\"text/javascript\">\n\t\tfunction utmx_section(){}function utmx(){}\n\t\t(function()})();\n\t</script>456";

正則表達式 ：

sFetch = Regex.Replace(sFetch, "<script.*?>.*?</script>", "", RegexOptions.IgnoreCase);

Answer 1

添加RegexOptions.Singleline

RegexOptions.IgnoreCase | RegexOptions.Singleline

這將永遠不會影響到一個。

<script
>
alert(1)
</script
/**/
>

因此，查找HTML Agility Pack等HTML解析器

Answer 2

正則表達式失敗的原因是你的輸入有newlines和元字符. 與它不符。

要解決此問題，您可以使用RegexOptions.Singleline選項作為S.Mark說，或者您可以將正則表達式更改為：

"<script[\d\D]*?>[\d\D]*?</script>"

使用[\\d\\D]代替. 。

\\d是任何數字， \\D是任何非數字，因此[\\d\\D]是一個數字或非數字，實際上是任何字符。

Answer 3

如果您確實想要清理html字符串（並且您使用的是.NET），那么請查看Microsoft Web Protection Library ：

Sanitizer.GetSafeHtmlFragment(untrustedHtml);

有一個描述在這里。

Answer 4

這有點短：

 "<script[^<]*</script>"

要么

"<[^>]*>[^>]*>"