Java regex to match all html elements except one special case

Question

I have a string with some markup which looks like this:

The quick brown <a href="www.fox.org">fox</a> jumped over the lazy <a href="entry://id=6000009">dog</a> <img src="dog.png" />.

I'm trying to strip away everything except the anchor elements with "entry://id=" inside. Thus the desired output from the above example would be:

The quick brown fox jumped over the lazy <a href="entry://id=6000009">dog</a>.

Writing this match, the closest I've come so far is:

<.*?>!<a href=\\"entry://id=\\\\d+\\">.*?<\\\\/a>

But I can't figure out why this doesn't work. Any help (apart from the "why don't you use a parser" :) would be greatly appreciated!

Answer 1

I would really not use regexps for parsing HTML. HTML isn't regular and there are no end of edge cases to trip you up.

Check out JTidy instead.

Answer 2

Not easily possible with regex. I recommend a parser that understands the semantics of HTML/XML.

If you insist , you could do a multi-step approach, something like:

Replace "<(a\\s*href="entry:.*?/a)>" with "{{{{\\1}}}}"
Replace "<(?!/a}}}})[^>]*>" with ""
Replace "{{{{" with "<"
Replace "}}}}" with ">"

Be warned that the above is error-prone and will fail at some point. Consider it an ugly hack, not a real solution. Something like the above is okay for a one-off edit of some text file in a regex-aware text editor, but for repeated, real-world use as part of data processing in an app - not so much.

Answer 3

Using this :

((<a href="entry://id=\d+">.*?</a>)|<!\[CDATA\[.*?\]\]>|<!--.*?-->|<.*?>)

and combining it with a replace all $2 would work for your example. The code below proves it:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static org.junit.Assert.*;
import org.junit.Test;


public class TestStack1305864 {

    @Test
    public void matcherWithCdataAndComments(){
        String s="The quick <span>brown</span> <a href=\"www.fox.org\">fox</a> jumped over the lazy <![CDATA[ > ]]> <a href=\"entry://id=6000009\">dog</a> <img src=\"dog.png\" />.";
        String r="The quick brown fox jumped over the lazy <a href=\"entry://id=6000009\">dog</a> .";
        String pattern="((<a href=\"entry://id=\\d+\">.*?</a>)|<!\\[CDATA\\[.*?\\]\\]>|<!--.*?-->|<.*?>)";
        Pattern p = Pattern.compile(pattern);
        Matcher m = p.matcher(s);

        String t = s.replaceAll(pattern, "$2");
        System.out.println(t);
        System.out.println(r);
        assertEquals(r, t);
    }
}

The idea is to capture all the elements you are interested to keep in a specific group so you can insert them back in the string.
This way you can replace all :
For every element which doesn't match the interesting ones the group will be empty and the element will be replaced with ""
For the interesting elements the group will not be empty and will be appended to the result String.

edit: handle nested < or > in CDATA and comments
edit: see http://martinfowler.com/bliki/ComposedRegex.html for a regex composition pattern, designed to make regex more readable for maintenance purposes.

Java regex to match all html elements except one special case

Question

3 answers

solution1
7 2009-08-20 12:36:09

solution2
1 2009-08-20 12:44:50

solution3
1 ACCPTED 2009-08-20 13:37:32

Java regex to match all html elements except one special case

Question

3 answers

solution1 7 2009-08-20 12:36:09

solution2 1 2009-08-20 12:44:50

solution3 1 ACCPTED 2009-08-20 13:37:32

solution1
7 2009-08-20 12:36:09

solution2
1 2009-08-20 12:44:50

solution3
1 ACCPTED 2009-08-20 13:37:32