简体   繁体   English

Regexp删除所有html标签除外<br>

[英]Regexp to remove all html tags except <br>

I'm trying to make a regexp in javascript to remove ALL the html tags from an input string, except <br> . 我正在尝试在javascript中创建一个regexp来删除输入字符串中的所有html标记,除了<br>

I use /(<([^>]+)>)/ig for the tags and have tried a few things like adding [^(br)] to it, but I'm just getting confused now. 我使用/(<([^>]+)>)/ig作为标签,并尝试了一些诸如添加[^(br)]之类的东西,但我现在只是感到困惑。

Could anyone help? 有人可以帮忙吗? I'm sure it's going to be a speed contest between SO gurus, so if the answer explains the logic of the expression, I'll choose it over the others. 我相信它会成为SO大师之间的速度竞赛,所以如果答案解释了表达的逻辑,我会选择其他人。

Edit : 编辑:

To all the 'don't do it' people, let me quote the following from Stack Overflow 对于所有“不要做”的人,让我引用Stack Overflow中的以下内容

While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML. 虽然确实要求正则表达式解析任意HTML就像要求Paris Hilton编写操作系统一样,但有时候解析一组有限的已知HTML也是合适的。

In this particular case, it's a bunch of text in a div, that stays consistent within many pages. 在这种特殊情况下,它是div中的一堆文本,在许多页面内保持一致。 I just want to get rid of a few cases (1% at most) where the users have included spans, strongs and a few other formatting tags. 我只是想摆脱一些用户包含跨度,强势和一些其他格式标签的情况(最多1%)。 It is not worth more than the time to regexp it out as it barely happens over the thousands of pages I process. 在我处理的数千个页面中几乎没有发生这种情况的时候,它的价值不仅仅是时间。 If you have a better, faster to implement idea, feel free to post it as an answer ;) 如果您有更好,更快的实现想法,请随意将其作为答案发布;)

Edit 2 编辑2

So many comments, I feel like adding a disclaimer : Using Regexp to parse HTML is bad . 这么多评论,我想添加免责声明:使用Regexp解析HTML是不好的 It will not work consistently and there are much better ways. 它不会始终如一地工作,并且有更好的方法。 Domparser has been mentioned; 已经提到了Domparser; there's Cheerio or jsdom on Node.js, and a lot more libraries that will parse a HTML document correctly (in 99% cases). 在Node.js上有Cheerio或jsdom,还有更多的库可以正确地解析HTML文档(在99%的情况下)。 In that case, it is more like a string that happens to contain a few <...> that I needed to remove. 在这种情况下,它更像是一个字符串碰巧包含一些我需要删除的<...>

尝试这个:

/(<((?!br)[^>]+)>)/ig

Use a DOMParser to parse your string, then traverse it (I used the code in this question ), extracting the parts that you are interested in: 使用DOMParser来解析你的字符串,然后遍历它(我使用了这个问题中的代码),提取你感兴趣的部分:

 var str = "<div>some text <span>some more</span><br /><a href='#'>a link</a>"; var parser = new DOMParser(); var dom = parser.parseFromString(str, "text/html"); var text = ""; var walkDOM = function (node, func) { func(node); node = node.firstChild; while (node) { walkDOM(node,func); node = node.nextSibling; } }; walkDOM(dom, function (node) { if (node.tagName === 'BR') { text += node.outerHTML; } else if (node.nodeType === 3) { // Text node text += node.nodeValue; } }); alert(text); 

This might work. 这可能会奏效。 But, no matter the regex, it will fail to parse html. 但是,无论正则表达式如何,它都无法解析html。

 # /(?!<\/?br\s*\/?>)<[^>]+>/g

 (?! < /? br \s* /? > )
 < [^>]+ >

I ended up using : 我最终使用:

.replace('<br>','%br%').replace(/(<([^>]+)>)/g,'')

then I split on the '%br%' instead of the regular br tag. 然后我拆分'%br%'而不是普通的br标签。 It is not an HTML parser , I am sure it will fail to parse 100% of the World Wide Web, and it solves my particular problem 100% of the time (just tried and tested). 不是HTML解析器 ,我相信它将无法解析100%的万维网,并且它在100%的时间内解决了我的特定问题(刚刚尝试过并经过测试)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM