简体   繁体   中英

Regex replace string but not inside html tag

I want to replace a string in HTML page using JavaScript but ignore it, if it is in an HTML tag, for example:

<a href="google.com">visit google search engine</a>
you can search on google tatatata...

I want to replace google by <b>google</b> , but not here:

<a href="google.com">visit google search engine</a>
you can search on <b>google</b> tatatata...

I tried with this one:

regex = new RegExp(">([^<]*)?(google)([^>]*)?<", 'i');
el.innerHTML =  el.innerHTML.replace(regex,'>$1<b>$2</b>$3<');

but the problem: I got <b>google</b> inside the <a> tag:

<a href="google.com">visit <b>google</b> search engine</a>
you can search on <b>google</b> tatatata...

How can fix this?

You'd be better using an html parser for this, rather than regex. I'm not sure it can be done 100% reliably.

You may or may not be able to do with with a regexp. It depends on how precisely you can define the conditions. Saying you want the string replaced except if it's in an HTML tag is not narrow enough, since everything on the page is presumably within some HTML tag (BODY if nothing else).

It would probably work better to traverse the DOM tree for this instead of trying to use a regexp on the HTML.

你无法真正做到这一点,你的“谷歌”总是在某个标签中,要么全部替换,要么全部替换

Parsing HTML with a regular expression is not going to be easy for anything other than trivial cases, since HTML isn't regular .

For more details see this Stackoverflow question (and answers).

I think you're all missing the question here...

When he says inside the tag, he means inside the opening tag, as in the <a href="google.com"> tag...This is something quite different than text, say, inside a <p> </p> tag pair or <body> </body>. While I don't have the answer yet, I'm struggling with this same problem and I know it has to be solvable using regex. Once I figure it out, i'll come back and post.

WORKAROUND

If You can't use a html parser or are quite confident about Your html structure try this:

  1. do the "bad" changing
  2. repeat replace (<[^>]*)(<[^>]+>) to $1 a few times (as much as You need)

It's a simple workaround, but works for me.

Cons? Well... You have to do the replace twice for the case ... ...> as it removes only first unwanted tag from every tag on the page

[edit:] SOLUTION

Why not use jQuery, put the html code into the page and do something like this:

$(containerOrSth).find('a').each(function(){
 if($(this).children().length==0){
 $(this).text($(this).text().replace('google','evil')); 
 }else{
 //here You have to care about children tags, but You have to know where to expect them - before or after text. comment for more help
 }
});

I'm using regex = new RegExp("(?=[^>]*<)google", 'i');

Well, since everything is part of a tag, your request makes no real sense. If it's just the <a /> tag, you might just check for that part. Mainly by making sure you don't have a tailing </a> tag before a fresh <a>

You can do that using REGEX, but filtering blocks like STYLE, SCRIPT and CDATA will need more work, and not implemented in the following solution.

Most of the answers state that 'your data is always in some tags' but they are missing the point, the data is always 'between' some tags, and you want to filter where it is 'in' a tag.

Note that tag characters in inline scripts will likely break this, so if they exist, they should be processed seperately with this method. Take a look at here :
complex html string.replace function

I can give you a hacky solution… Pick a non printable character that's not in your string…. Dup your buffer… now overwrite the tags in your dup buffer using the non printable character… perform regex to find position and length of match on dup buffer … Now you know where to perform replace in original buffer

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM