简体   繁体   中英

Java regex to match all html elements except one special case

I have a string with some markup which looks like this:

The quick brown <a href="www.fox.org">fox</a> jumped over the lazy <a href="entry://id=6000009">dog</a> <img src="dog.png" />.

I'm trying to strip away everything except the anchor elements with "entry://id=" inside. Thus the desired output from the above example would be:

The quick brown fox jumped over the lazy <a href="entry://id=6000009">dog</a>.

Writing this match, the closest I've come so far is:

<.*?>!<a href=\\"entry://id=\\\\d+\\">.*?<\\\\/a>

But I can't figure out why this doesn't work. Any help (apart from the "why don't you use a parser" :) would be greatly appreciated!

I would really not use regexps for parsing HTML. HTML isn't regular and there are no end of edge cases to trip you up.

Check out JTidy instead.

Not easily possible with regex. I recommend a parser that understands the semantics of HTML/XML.

If you insist , you could do a multi-step approach, something like:

  • Replace "<(a\\s*href="entry:.*?/a)>" with "{{{{\\1}}}}"
  • Replace "<(?!/a}}}})[^>]*>" with ""
  • Replace "{{{{" with "<"
  • Replace "}}}}" with ">"

Be warned that the above is error-prone and will fail at some point. Consider it an ugly hack, not a real solution. Something like the above is okay for a one-off edit of some text file in a regex-aware text editor, but for repeated, real-world use as part of data processing in an app - not so much.

Using this :

((<a href="entry://id=\d+">.*?</a>)|<!\[CDATA\[.*?\]\]>|<!--.*?-->|<.*?>)

and combining it with a replace all $2 would work for your example. The code below proves it:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static org.junit.Assert.*;
import org.junit.Test;


public class TestStack1305864 {

    @Test
    public void matcherWithCdataAndComments(){
        String s="The quick <span>brown</span> <a href=\"www.fox.org\">fox</a> jumped over the lazy <![CDATA[ > ]]> <a href=\"entry://id=6000009\">dog</a> <img src=\"dog.png\" />.";
        String r="The quick brown fox jumped over the lazy <a href=\"entry://id=6000009\">dog</a> .";
        String pattern="((<a href=\"entry://id=\\d+\">.*?</a>)|<!\\[CDATA\\[.*?\\]\\]>|<!--.*?-->|<.*?>)";
        Pattern p = Pattern.compile(pattern);
        Matcher m = p.matcher(s);

        String t = s.replaceAll(pattern, "$2");
        System.out.println(t);
        System.out.println(r);
        assertEquals(r, t);
    }
}

The idea is to capture all the elements you are interested to keep in a specific group so you can insert them back in the string.
This way you can replace all :
For every element which doesn't match the interesting ones the group will be empty and the element will be replaced with ""
For the interesting elements the group will not be empty and will be appended to the result String.

edit: handle nested < or > in CDATA and comments
edit: see http://martinfowler.com/bliki/ComposedRegex.html for a regex composition pattern, designed to make regex more readable for maintenance purposes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM