简体   繁体   中英

Regular expression in javascript to match outside of XML tags

I want find all matches of "a" in <span class="get">habbitant morbi</span> triastbbitique , except "a" in tags (See below "a" between **).

<span class="get">h*a*bbit*a*nt morbi</span> tri*a*stbbitique.

If I find them, I want to replace them and also I want to save original tags.

This expression doesn't work:

var variable = "a";
var reg = new RegExp("[^<]."+variable+".[^>]$",'gi');

I would recommend to not use a regular expression to parse HTML ; it's not a regular grammar, and you will experience pain for all but simple cases.

Your question is still a bit unclear, but let me try rephrasing to see if I have it right:

You'd like to get all matches of a given string in a HTML document, except for matches in <tag> bodies?


Assuming you're using jQuery or similar:

// Let the browser parse it for you:
var container = document.createElement()
container.innerHTML = '<span class="get">habbitant morbi</span> triastbbitique'
var doc_text  = $(container).text()

// And then you can just regex away normally:
doc_text.match(/a/gi)

(Even better would be to use DOMParser , but that doesn't have wide browser support yet)

If you're in Node, then you want to look for some libraries that help you parse HTML nodes (like jsdom); and then just splat out all the next nodes.

Note that this question isn't about parsing. This is lexing. Something that regex are regularly and properly used for.

If you want to go with regex there are a couple of ways you could do this.

  • A simple hack lookahead like:

     a(?![^<>]*>)

    note that this wont handle < and > quoted in tags/unescaped outside of tags properly.

  • A full blown tokenizer of the form:

     (expression for tag|comments|etc)|(stuff outside that that i'm interested in)

    Replaced with a function that does different things depending on which part was matched. If $1 matched it would be replaced by it self, if $2 matchehd replace it with *$2*

The full tokenizer way is of course not a trivial task, the spec isn't small .

But if simplifying to only match the basic tags, ignore CDATA, comments, script/style tags, etc, you could use the following:

var str = '<span class="a <lal> a" attr>habbitant 2 > morbi. 2a < 3a</span> triastbbitique';

var re = /(<[a-z\/](?:"[^"]*"|'[^']*'|[^'">]+)*>)|(a)/gi;

var res = str.replace(re, function(m, tag, a){
    return tag ? tag : "*" + a + "*";
});

Result:

<span class="a <lal> a" attr>h*a*bbit*a*nt 2 > morbi. 2*a* < 3*a*</span> tri*a*stbbitique

Live Example:

 var str = '<span class="a <lal> a" attr>habbitant 2 > morbi. 2a < 3a</span> triastbbitique'; var re = /(<[az\\/](?:"[^"]*"|'[^']*'|[^'">]+)*>)|(a)/gi; var res = str.replace(re, function(m, tag, a){ return tag ? tag : "*" + a + "*"; }); console.log(res);

This handles messy tags, quotes and unescaped < / > in the HTML.


Couple examples of tokenizing HTML tags with regex (which should translate fine to JS regex):

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM