简体   繁体   English

如何在例外的javascript中删除所有html标签?

[英]How do I strip all html tags in javascript with exceptions?

I've been beating my head against this reg ex for the longest time now and am hoping someone can help. 我现在最长时间一直在打击这个前锋,我希望有人可以提供帮助。 Basically I have a WYSIWYG field where a user can type formatted text. 基本上我有一个WYSIWYG字段,用户可以在其中键入格式化文本。 But of course they will copy and paste form word/web/etc. 但他们当然会复制并粘贴表格/网页等。 So I have a JS function catching the input on paste. 所以我有一个JS函数捕获粘贴的输入。 I got a function that will strip ALL of the formatting on the text which is nice, but I'd like to have it leave tags like p and br so it's not just a big mess. 我有一个功能,将删除文本上的所有格式,这是很好的,但我想让它留下像p和br这样的标签,所以这不仅仅是一个大混乱。

Any regex ninjas out there? 那里有任何正则表达的忍者吗? Here is what I have so far and it works. 这是我到目前为止所做的工作。 Just need to allow tags. 只需要允许标签。

o.node.innerHTML=o.node.innerHTML.replace(/(<([^>]+)>)/ig,"");

The browser already has a perfectly good parsed HTML tree in o.node . 浏览器在o.node已经有一个非常好的解析HTML树。 Serialising the document content to HTML (using innerHTML ), trying to hack it about with regex (which cannot parse HTML reliably), then re-parsing the results back into document content by setting innerHTML ... is just a bit perverse really. 将文档内容序列化为HTML(使用innerHTML ),尝试使用正则表达式( 无法可靠地解析HTML)破解它,然后通过设置innerHTML将结果重新解析回文档内容......实际上有点不正常。

Instead, inspect the element and attribute nodes you already have inside o.node , removing the ones you don't want, eg.: 相反,检查o.node已有的元素和属性节点,删除你不想要的节点,例如:

filterNodes(o.node, {p: [], br: [], a: ['href']});

Defined as: 定义为:

// Remove elements and attributes that do not meet a whitelist lookup of lowercase element
// name to list of lowercase attribute names.
//
function filterNodes(element, allow) {
    // Recurse into child elements
    //
    Array.fromList(element.childNodes).forEach(function(child) {
        if (child.nodeType===1) {
            filterNodes(child, allow);

            var tag= child.tagName.toLowerCase();
            if (tag in allow) {

                // Remove unwanted attributes
                //
                Array.fromList(child.attributes).forEach(function(attr) {
                    if (allow[tag].indexOf(attr.name.toLowerCase())===-1)
                       child.removeAttributeNode(attr);
                });

            } else {

                // Replace unwanted elements with their contents
                //
                while (child.firstChild)
                    element.insertBefore(child.firstChild, child);
                element.removeChild(child);
            }
        }
    });
}

// ECMAScript Fifth Edition (and JavaScript 1.6) array methods used by `filterNodes`.
// Because not all browsers have these natively yet, bodge in support if missing.
//
if (!('indexOf' in Array.prototype)) {
    Array.prototype.indexOf= function(find, ix /*opt*/) {
        for (var i= ix || 0, n= this.length; i<n; i++)
            if (i in this && this[i]===find)
                return i;
        return -1;
    };
}
if (!('forEach' in Array.prototype)) {
    Array.prototype.forEach= function(action, that /*opt*/) {
        for (var i= 0, n= this.length; i<n; i++)
            if (i in this)
                action.call(that, this[i], i, this);
    };
}

// Utility function used by filterNodes. This is really just `Array.prototype.slice()`
// except that the ECMAScript standard doesn't guarantee we're allowed to call that on
// a host object like a DOM NodeList, boo.
//
Array.fromList= function(list) {
    var array= new Array(list.length);
    for (var i= 0, n= list.length; i<n; i++)
        array[i]= list[i];
    return array;
};

First, I'm not sure if regex is the right tool for this. 首先,我不确定正则表达式是否是正确的工具。 A user might enter invalid HTML (forget a > or put a > inside attributes), and a regex would fail then. 用户可能会输入无效的HTML(忘记>或放置>内部属性),然后正则表达式将失败。 I don't know, though, if a parser would be much better/more bulletproof. 但是,我不知道解析器是否会更好/更防弹。

Second, you have a few unnecessary parentheses in your regex. 其次,你的正则表达式中有一些不必要的括号。

Third, you could use lookahead to exclude certain tags: 第三,您可以使用前瞻来排除某些标签:

o.node.innerHTML=o.node.innerHTML.replace(/<(?!\s*\/?(br|p)\b)[^>]+>/ig,"");

Explanation: 说明:

< match opening angle bracket <匹配开角支架

(?!\\s*\\/?(br|p)\\b) assert that it's not possible to match zero or more whitespace characters, zero or one / , any one of br or p , followed directly by a word boundary. (?!\\s*\\/?(br|p)\\b)断言不可能匹配零个或多个空白字符,零或一个/brp任何一个,直接跟一个字边界。 The word boundary is important, otherwise you might trigger the lookahead on tags like <pre> or <param ...> . 单词边界很重要,否则您可能会在<pre><param ...>等标签上触发前瞻。

[^>]+ match one or more characters that are no closing angle brackets [^>]+匹配一个或多个没有关闭尖括号的字符

> match the closing angle brackets. >匹配关闭尖括号。

Note that you might run into trouble if a closing angle bracket occurs somewhere inside a tag. 请注意,如果在标记内某处出现结束尖括号,则可能会遇到麻烦。

So this will match (and strip) 所以这将匹配(并剥离)

<pre> <a href="dot.com"> </a> </pre>

and leave 然后离开

<p> < p > < /br > <br /> <br> etc. <p> < p > < /br > <br /> <br>

alone. 单独。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM