简体   繁体   中英

java regex replace all html tags except br

I need a regular expression that can be used with replaceall to replace all the html tags with empty string except any variations of br to maintain the line breaks.

I found the following to replace all html tags <\\s*br\\s*\\[^>]

You might get some answers that claim to work.

Those answers might even work for the particular cases you try them against.

But know that regular expressions (which I'm fond of in general) are the wrong tool for the job in this case.

And as your project evolves and needs to cover more complex HTML inputs, the regular expression will get more and more convoluted, and there may well come a time when it simply cannot solve your problem anymore, period.

Do it the right way from the beginning. Use an HTML parser, not a regex.

For reference, here are some related SO posts:

If the HTML is known to be valid, then you can use this regex (case-insensitive):

<(?!br\b)/?[a-z]([^"'>]|"[^"]*"|'[^']*')*>

but it can fail in interesting ways if you give it invalid HTML. Also, I took "HTML tags" pretty literally; the above won't cover <!-- HTML comments --> and <!DOCTYPE declarations> , and won't convert <![CDATA[ blocks ]]> and &entity; s to plain text.

It's probably better to take a step back, think about why you want to strip out these HTML tags — that is, what you're actually trying to achieve — and then find an HTML-handling library that offers a better way to achieve that goal. HTML cleaning is really a solved problem; you shouldn't need to reinvent it.

UPDATE : I've just realized that, even for valid HTML, the above has some major limitations. For example, it will mishandle something like <!--<yes--> (converting it to just <!-- ), and also something like <script><foo></script> (since HTML proper has a small number of tags with CDATA content, that is, everything after the start-tag until the first </ is taken to be character data, not containing HTML tags; fortunately, XHTML was forced to get rid of this concept due to XML's lack of support for it). Both of these limitations can be addressed, of course — using more regexes! — but they should help reinforce the point that you should use a well-tested HTML-handling library rather than trying to roll your own regexes. If you have a lot of guarantees about the nature of the HTML you're trying to handle, then regexes can be useful; but if what you're trying to do is strip out arbitrary tags, then that's a good sign that you don't have these sorts of guarantees.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM