简体   繁体   中英

Regular expression to replace match keywords outside html tags AND anchor (a) tag text

I am developing an asp.net application. I want to add a keyword linking system.

I want to make the keyword a hyper-link to another page. But, I should not link the keyword if its currently linked (to any page). For example:

it is a <a href="http://www.somesite.com">linked keyword</a> and it should be a linked keyword.

should convert to:

it is a <a href="http://www.somesite.com">linked keyword</a> and it should be a linked <a href="http://newlycreatedLink.com">keyword</a>.

As you can see, the first keyword should be left intact.

Could you help me please to solve this problem?

I've found this link in asp.net forums. But I should tune the answer to exclude currently linked keywords. I've searched everywhere but found nothing.

To check if the keywords is "outside", look ahead

  • (?= if after the keyword there's an opening <tag or the $ end
  • [^<>]* any amount of characters, that are NOT > OR <
  • followed by (?:<\\w|$) where \\w is a shorthand to word-charcters [a-zA-Z_0-9]

So the pattern could look like:

String pattern = @"(?i)\bkeyword\b(?=[^<>]*(?:<\w|$))";

String replacement = @"<a href=\"http://newlycreatedLink.com\">\0</a>";

Put the keyword into word-boundaries \\b and used (?i) i modifier for case insensitive.

So this would only replace keyword that is followed by an opening-tag or the end.


UPDATE : To replace keyword also "inside" tags, that don't end up with </a add |<\\/[^a] :

String pattern = @"(?i)\bkeyword\b(?=[^<>]*(?:<\w|<\/[^a]|$))";

Don't use regular expressions for sophisticated HTML parsing like this. Use a proper HTML parser instead — here's why .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM