java regex replace all html tags except br

Question

I need a regular expression that can be used with replaceall to replace all the html tags with empty string except any variations of br to maintain the line breaks.

I found the following to replace all html tags <\\s*br\\s*\\[^>]

Answer 1

You might get some answers that claim to work.

Those answers might even work for the particular cases you try them against.

But know that regular expressions (which I'm fond of in general) are the wrong tool for the job in this case.

And as your project evolves and needs to cover more complex HTML inputs, the regular expression will get more and more convoluted, and there may well come a time when it simply cannot solve your problem anymore, period.

Do it the right way from the beginning. Use an HTML parser, not a regex.

For reference, here are some related SO posts:

Regex to match all HTML tags except <p> and </p>
Regex to replace all \\n in a String, but no those inside [code] [/code] tag
RegEx match open tags except XHTML self-contained tags - bobince says it much more thoroughly than I do (:

Answer 2

If the HTML is known to be valid, then you can use this regex (case-insensitive):

<(?!br\b)/?[a-z]([^"'>]|"[^"]*"|'[^']*')*>

but it can fail in interesting ways if you give it invalid HTML. Also, I took "HTML tags" pretty literally; the above won't cover  and <!DOCTYPE declarations> , and won't convert <![CDATA[ blocks ]]> and &entity; s to plain text.

It's probably better to take a step back, think about why you want to strip out these HTML tags — that is, what you're actually trying to achieve — and then find an HTML-handling library that offers a better way to achieve that goal. HTML cleaning is really a solved problem; you shouldn't need to reinvent it.

UPDATE : I've just realized that, even for valid HTML, the above has some major limitations. For example, it will mishandle something like  (converting it to just <!-- ), and also something like <script><foo></script> (since HTML proper has a small number of tags with CDATA content, that is, everything after the start-tag until the first </ is taken to be character data, not containing HTML tags; fortunately, XHTML was forced to get rid of this concept due to XML's lack of support for it). Both of these limitations can be addressed, of course — using more regexes! — but they should help reinforce the point that you should use a well-tested HTML-handling library rather than trying to roll your own regexes. If you have a lot of guarantees about the nature of the HTML you're trying to handle, then regexes can be useful; but if what you're trying to do is strip out arbitrary tags, then that's a good sign that you don't have these sorts of guarantees.

java regex replace all html tags except br

Question

2 answers

solution1
4 ACCPTED 2011-11-18 17:58:00

solution2
1 2011-11-18 17:56:58

java regex replace all html tags except br

Question

2 answers

solution1 4 ACCPTED 2011-11-18 17:58:00

solution2 1 2011-11-18 17:56:58

solution1
4 ACCPTED 2011-11-18 17:58:00

solution2
1 2011-11-18 17:56:58