Regular expression in javascript to match outside of XML tags

Question

I want find all matches of "a" in <span class="get">habbitant morbi</span> triastbbitique , except "a" in tags (See below "a" between **).

<span class="get">h*a*bbit*a*nt morbi</span> tri*a*stbbitique.

If I find them, I want to replace them and also I want to save original tags.

This expression doesn't work:

var variable = "a";
var reg = new RegExp("[^<]."+variable+".[^>]$",'gi');

Answer 1

I would recommend to not use a regular expression to parse HTML ; it's not a regular grammar, and you will experience pain for all but simple cases.

Your question is still a bit unclear, but let me try rephrasing to see if I have it right:

You'd like to get all matches of a given string in a HTML document, except for matches in <tag> bodies?

Assuming you're using jQuery or similar:

// Let the browser parse it for you:
var container = document.createElement()
container.innerHTML = '<span class="get">habbitant morbi</span> triastbbitique'
var doc_text  = $(container).text()

// And then you can just regex away normally:
doc_text.match(/a/gi)

(Even better would be to use DOMParser , but that doesn't have wide browser support yet)

If you're in Node, then you want to look for some libraries that help you parse HTML nodes (like jsdom); and then just splat out all the next nodes.

Answer 2

Note that this question isn't about parsing. This is lexing. Something that regex are regularly and properly used for.

If you want to go with regex there are a couple of ways you could do this.

A simple hack lookahead like:
```
 a(?![^<>]*>)
```
note that this wont handle < and > quoted in tags/unescaped outside of tags properly.
A full blown tokenizer of the form:
```
 (expression for tag|comments|etc)|(stuff outside that that i'm interested in)
```
Replaced with a function that does different things depending on which part was matched. If $1 matched it would be replaced by it self, if $2 matchehd replace it with *$2*

The full tokenizer way is of course not a trivial task, the spec isn't small .

But if simplifying to only match the basic tags, ignore CDATA, comments, script/style tags, etc, you could use the following:

var str = '<span class="a <lal> a" attr>habbitant 2 > morbi. 2a < 3a</span> triastbbitique';

var re = /(<[a-z\/](?:"[^"]*"|'[^']*'|[^'">]+)*>)|(a)/gi;

var res = str.replace(re, function(m, tag, a){
    return tag ? tag : "*" + a + "*";
});

Result:

<span class="a <lal> a" attr>h*a*bbit*a*nt 2 > morbi. 2*a* < 3*a*</span> tri*a*stbbitique

Live Example:

 var str = '<span class="a <lal> a" attr>habbitant 2 > morbi. 2a < 3a</span> triastbbitique'; var re = /(<[az\\/](?:"[^"]*"|'[^']*'|[^'">]+)*>)|(a)/gi; var res = str.replace(re, function(m, tag, a){ return tag ? tag : "*" + a + "*"; }); console.log(res);

This handles messy tags, quotes and unescaped < / > in the HTML.

Couple examples of tokenizing HTML tags with regex (which should translate fine to JS regex):

Regular expression in javascript to match outside of XML tags

Question

2 answers

solution1
4 2013-03-09 22:38:08

solution2
2 ACCPTED 2013-03-09 22:54:56

Regular expression in javascript to match outside of XML tags

Question

2 answers

solution1 4 2013-03-09 22:38:08

solution2 2 ACCPTED 2013-03-09 22:54:56

solution1
4 2013-03-09 22:38:08

solution2
2 ACCPTED 2013-03-09 22:54:56