简体   繁体   English

使用 JS 正则表达式从 html 中删除所有脚本标签

[英]Removing all script tags from html with JS Regular Expression

I want to strip script tags out of this HTML at Pastebin:我想从 Pastebin 的这个 HTML 中去除脚本标签:

http://pastebin.com/mdxygM0a http://pastebin.com/mdxygM0a

I tried using the below regular expression:我尝试使用以下正则表达式:

html.replace(/<script.*>.*<\/script>/ims, " ")

But it does not remove all of the script tags in the HTML. It only removes in-line scripts.但它不会删除 HTML 中的所有脚本标签。它只会删除内联脚本。 I'm looking for some regex that can remove all of the script tags (in-line and multi-line).我正在寻找一些可以删除所有脚本标签(内联和多行)的正则表达式。 It would be highly appreciated if a test is carried out on my sample http://pastebin.com/mdxygM0a如果对我的样品http://pastebin.com/mdxygM0a进行测试,我们将不胜感激

jQuery uses a regex to remove script tags in some cases and I'm pretty sure its devs had a damn good reason to do so. jQuery 在某些情况下使用正则表达式来删除脚本标签,我很确定它的开发人员有充分的理由这样做。 Probably some browser does execute scripts when inserting them using innerHTML .可能某些浏览在使用innerHTML插入脚本时会执行脚本。

Here's the regex:这是正则表达式:

/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi

And before people start crying "but regexes for HTML are evil": Yes, they are - but for script tags they are safe because of the special behaviour - a <script> section may not contain </script> at all unless it should end at this position.在人们开始哭泣“但是 HTML 的正则表达式是邪恶的”之前: 是的,它们是- 但是对于脚本标签,由于特殊行为它们是安全的 - <script>部分可能根本不包含</script>除非它应该结束在这个 position。 So matching it with a regex is easily possible.因此,很容易将其与正则表达式匹配。 However, from a quick look the regex above does not account for trailing whitespace inside the closing tag so you'd have to test if </script etc. will still work.但是,快速浏览一下,上面的正则表达式不考虑结束标记内的尾随空格,因此您必须测试</script等是否仍然有效。

Attempting to remove HTML markup using a regular expression is problematic.尝试使用正则表达式删除 HTML 标记是有问题的。 You don't know what's in there as script or attribute values.您不知道其中的脚本或属性值是什么。 One way is to insert it as the innerHTML of a div, remove any script elements and return the innerHTML, eg一种方法是将其作为 div 的 innerHTML 插入,删除任何脚本元素并返回 innerHTML,例如

  function stripScripts(s) {
    var div = document.createElement('div');
    div.innerHTML = s;
    var scripts = div.getElementsByTagName('script');
    var i = scripts.length;
    while (i--) {
      scripts[i].parentNode.removeChild(scripts[i]);
    }
    return div.innerHTML;
  }

alert(
 stripScripts('<span><script type="text/javascript">alert(\'foo\');<\/script><\/span>')
);

Note that at present, browsers will not execute the script if inserted using the innerHTML property, and likely never will especially as the element is not added to the document.请注意,目前,如果使用 innerHTML 属性插入,浏览器将不会执行脚本,并且可能永远不会执行,尤其是当元素未添加到文档中时。

Regexes are beatable, but if you have a string version of HTML that you don't want to inject into a DOM, they may be the best approach.正则表达式是可击败的,但如果您有一个不想注入 DOM 的 HTML 的字符串版本,那么它们可能是最好的方法。 You may want to put it in a loop to handle something like:你可能想把它放在一个循环中来处理类似的事情:

<scr<script>Ha!</script>ipt> alert(document.cookie);</script>

Here's what I did, using the jquery regex from above:这是我所做的,使用上面的 jquery 正则表达式:

var SCRIPT_REGEX = /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi;
while (SCRIPT_REGEX.test(text)) {
    text = text.replace(SCRIPT_REGEX, "");
}

This Regex should work too:这个正则表达式也应该工作:

<script(?:(?!\/\/)(?!\/\*)[^'"]|"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\/\/.*(?:\n)|\/\*(?:(?:.|\s))*?\*\/)*?<\/script>

It even allows to have "problematic" variable strings like these inside:它甚至允许在里面有像这样的“有问题的”变量字符串:

<script type="text/javascript">
   var test1 = "</script>";
   var test2 = '\'</script>';
   var test1 = "\"</script>";
   var test1 = "<script>\"";
   var test2 = '<scr\'ipt>';
   /* </script> */
   // </script>
   /* ' */
   // var foo=" '
</script>

It seams that jQuery and Prototype fail on these ones...似乎 jQuery 和 Prototype 在这些上都失败了……

Edit July 31 '17: Added a) non-capturing groups for better performance (and no empty groups) and b) support for JavaScript comments.编辑 2017 年 7 月 31 日:添加了 a) 非捕获组以获得更好的性能(并且没有空组)和 b) 支持 JavaScript 评论。

Whenever you have to resort to Regex based script tag cleanup.每当您不得不求助于基于正则表达式的脚本标记清理时。 At least add a white-space to the closing tag in the form of至少以以下形式在结束标记中添加一个空格

</script\s*>

Otherwise things like否则像

<script>alert(666)</script   >

would remain since trailing spaces after tagnames are valid.将保留,因为标记名后的尾随空格有效。

Why not using jQuery.parseHTML() http://api.jquery.com/jquery.parsehtml/ ?为什么不使用 jQuery.parseHTML() http://api.jquery.com/jquery.parsehtml/

If you want to remove all JavaScript code from some HTML text, then removing <script> tags isn't enough, because JavaScript can still live in "onclick", "onerror", "href" and other attributes.如果你想从一些 HTML 文本中删除所有 JavaScript 代码,那么删除<script>标记是不够的,因为 JavaScript 仍然可以存在于“onclick”中,“参考错误”,“参考”

Try out this npm module which handles all of this: https://www.npmjs.com/package/strip-js试试这个处理所有这些的 npm 模块: https://www.npmjs.com/package/strip-js

In my case, I needed a requirement to parse out the page title AND and have all the other goodness of jQuery, minus it firing scripts.在我的情况下,我需要解析页面标题并拥有 jQuery 的所有其他优点,减去它的触发脚本。 Here is my solution that seems to work.这是我似乎可行的解决方案。

        $.get('/somepage.htm', function (data) {
            // excluded code to extract title for simplicity
            var bodySI = data.indexOf('<body>') + '<body>'.length,
                bodyEI = data.indexOf('</body>'),
                body = data.substr(bodySI, bodyEI - bodySI),
                $body;

            body = body.replace(/<script[^>]*>/gi, ' <!-- ');
            body = body.replace(/<\/script>/gi, ' --> ');

            //console.log(body);

            $body = $('<div>').html(body);
            console.log($body.html());
        });

This kind of shortcuts worries about script because you are not trying to remove out the script tags and content, instead you are replacing them with comments rendering schemes to break them useless as you would have comments delimiting your script declarations.这种快捷方式担心脚本,因为您没有尝试删除脚本标签和内容,而是将它们替换为注释呈现方案以破坏它们无用,因为您将使用注释分隔脚本声明。

Let me know if that still presents a problem as it will help me too.让我知道这是否仍然存在问题,因为它也会帮助我。

You can do this without a regular expression.您可以在没有正则表达式的情况下执行此操作。 Simply cast your HTML string to an HTML node using document.createElement() , find all scripts with element.getElementsByTagName('script') , and then just remove() them!只需使用document.createElement()将您的 HTML 字符串转换为 HTML 节点,找到所有带有element.getElementsByTagName('script') ,然后remove()它们!

Fun fact: SO's demo does not like it when you create an element with a <script> tag, The snippet below will not run: but it does work at: Full Working Demo at JSBin.com .有趣的事实:当你创建一个带有<script>标签的元素时,SO 的演示不喜欢它,下面的代码片段不会运行:但它确实适用于: JSBin.com 的完整工作演示

 var el = document.createElement( 'html' ); el.innerHTML = "<p>Valid paragraph.</p><p>Another valid paragraph.</p><script>Dangerous scripting.;.</script><p>Last final paragraph;</p>"; var scripts = el.getElementsByTagName( 'script' ); // Live NodeList of your anchor elements for(var i = 0; i < scripts.length; i++) { var script = scripts[i]. script.remove(); } console.log(el.innerHTML);

This is a much cleaner solution than a regex, imho.这是一个比正则表达式更清洁的解决方案,恕我直言。

Here are a variety of shell scripts you can use to strip out different elements.以下是各种 shell 脚本,您可以使用这些脚本去除不同的元素。

# doctype
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/<\!DOCTYPE\s\+html[^>]*>/<\!DOCTYPE html>/gi" {} \;

# meta charset
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/<meta[^>]*content=[\"'][^\"']*utf-8[\"'][^>]*>/<meta charset=\"utf-8\">/gi" {} \;

# script text/javascript
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<script[^>]*\)\(\stype=[\"']text\/javascript[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \;

# style text/css
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<style[^>]*\)\(\stype=[\"']text\/css[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \;

# html xmlns
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<html[^>]*\)\(\sxmlns=[\"'][^\"']*[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \;

# html xml:lang
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<html[^>]*\)\(\sxml:lang=[\"'][^\"']*[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \;

/(?:(?!</s\w)<[^<] ) </s\w*/gi; /(?:(?!</s\w)<[^<] ) </s\w*/gi; - Removes any sequence in any combination with - 删除任意组合的任何序列

Try this:尝试这个:

var text = text.replace(/<script[^>]*>(?:(?!<\/script>)[^])*<\/script>/g, "")

Don't use regex to parse HTML.不要使用正则表达式来解析 HTML。

Consider the following string:考虑以下字符串:

var str = "<script>var false_closing_tag = '</script>';</script>";
var stripped = str.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
console.log(stripped); // Logs: ';</script>

The current, top voted regex answer will fail to fully remove this.当前投票最多的正则表达式答案将无法完全删除它。 (Try it). (尝试一下)。 I can't even run that in the SO editor or JSFiddle because both of them are using insufficient means to parse the code before running it.我什至无法在 SO 编辑器或 JSFiddle 中运行它,因为它们在运行代码之前都没有使用足够的方法来解析代码。

And the other option which involves adding it to a <div> element and then pulling the innerText of the div has negative side effects as well: It will actually run the code (which is a security concern) and it will remove ALL HTML and not just script tags.另一个涉及将它添加到<div>元素然后拉动 div 的innerText的选项也有负面影响:它实际上会运行代码(这是一个安全问题)并且它会删除所有 HTML 而不是只是脚本标签。

The Solution解决方案

You need to actually parse the text:您需要实际解析文本:

function stripScriptTags(str){
  if(typeof str !== 'string') {
    return false;
  }
  var opened_quote_type = null;
  var in_script_tag = false;
  var string_buffer = [];
  for (let i = 0; i < str.length; i++) {
    if(opened_quote_type === null && ["'", '"', '`'].includes(str[i])){
      opened_quote_type = str[i];
    }else if(opened_quote_type === str[i]){
      opened_quote_type = null;
    }
    if(str.length > i+7 && str.toUpperCase().substring(i, i+7) === '<SCRIPT'){
      i += 7;
      in_script_tag = true;
    }
    if(in_script_tag && 
       opened_quote_type === null && 
       str.length > i+9 && 
       str.toUpperCase().substring(i, i+9) === '</SCRIPT>'
    ){
      i += 9;
      in_script_tag = false;
    }
    if(!in_script_tag){
      string_buffer.push(str[i]);
    }
  }
  return string_buffer.join('');
}

You can try你可以试试

$("your_div_id").remove();  

or或者

 $("your_div_id").html(""); 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM