regex to escape non-html tags' angle brackets

Question

I have an html based text (with html tags), I want to find words that occur within angle brackets and replace the brackets with < and > or even when angle brackets are used as math symobls

eg:

String text= "Hello, <b> Whatever <br /> <table> <tr> <td width="300px"> 
              1 < 2 This is a <test> </td> </tr> </table>";

I want this to be :

Hello,  <b> Whatever <br /> <table>  <tr> <td width="300px"> 
1 &lt; 2 This is a &lt; test &gt; </td> </tr> </table>

THANKS in advance

Answer 1

I would suggest you to use Html Cleaner

If you look at the HomePage the example shows exactly how text is escaped.

<td><a href=index.html>1 -> Home Page</a>

is converted in

<td>
   <a href="index.html">1 -&gt; Home Page</a>
</td>

it will normalize your html to conform to standard xHtml. I used it in the past and (IMHO) it's pretty solid and more reliable than jTidy&Co. (and of course it's better then use regex or replace strategies...)

Answer 2

Please see RegEx match open tags except XHTML self-contained tags and don't use regex to parse html. Use a SGML parser but don't use regex. It would fail to often. HTML isn't a regular language.

Answer 3

If it were not for CSS, Javascript, and CData sections, it would be possible.

If you are only dealing with a subset of HTML, you could make the assumption that angle brackets not surrounded by valid element identifier characters can be encoded.

Something like "<(?=[^A-Za-z_:0-9/])" -> "<" and "(?<=[^A-Za-z_:0-9/])>" -> ">"

But, unless you are generating the HTML yourself and KNOW that it has no embedded CSS, javascript, CData, or object sections...

As fraido said, don't use regular expressions for non-regular languages.

Answer 4

As everyone says, you shouldn't rely on Regular Expressions to parse HTML. They simply can't do it. But, in my case, I wanted to capture any angle brackets that didn't look like they were in an HTML tag, and escape them. Since everything was going through a sanitizer afterwards security wasn't a concern, and the results just needed to be good enough to catch most situations, not all.

You need a Regexp Library that supports zero-width lookahead assertions. In my case, that was Oniguruma in Ruby 1.8.

To match the less than symbols (<), I did:

/<(?!(/?[A-Za-z_:0-9]+\s?/?>))/

Matching the greater than (>) symbols is harder. Most libraries don't support zero-width lookbehind assertions of a variable length. So you cheat: reverse the string, run a lookahead assertion, and reverse it back afterwards, using the following pattern:

>(?!(/?\s?[A-Za-z_:0-9]+/?<))

So, my code looks a bit like:

match_less_than = Oniguruma::ORegexp.new('<(?!(/?[A-Za-z_:0-9]+\s?/?>))')
match_less_than.gsub!(string, '&lt;')

match_greater_than = Oniguruma::ORegexp.new('>(?!(/?\s?[A-Za-z_:0-9]+/?<))')
string = match_greater_than.gsub(string.reverse, '&gt;'.reverse).reverse

Nasty, huh?

regex to escape non-html tags' angle brackets

Question

4 answers

solution1
3 2010-03-22 15:40:43

solution2
1 2010-03-22 15:43:05

solution3
0 2010-03-22 16:04:56

solution4
0 2010-11-01 11:14:33

regex to escape non-html tags' angle brackets

Question

4 answers

solution1 3 2010-03-22 15:40:43

solution2 1 2010-03-22 15:43:05

solution3 0 2010-03-22 16:04:56

solution4 0 2010-11-01 11:14:33

solution1
3 2010-03-22 15:40:43

solution2
1 2010-03-22 15:43:05

solution3
0 2010-03-22 16:04:56

solution4
0 2010-11-01 11:14:33