Extract text from html source using regular expressions java

Question

I like to extract text from html page using regular expressions. Here is my code:

String regExp="<h3 class=\"field-content\"><a[^>]*>(\\w+)</a></h3>";
    Pattern regExpMatcher=Pattern.compile(regExp,Pattern.UNICODE_CHARACTER_CLASS);

    String example="<h3 class=\"field-content\"><a href=\"/humana-akcija-na-kavadarechkite-navivachi-lozari\">Проба 1</a></h3><h3 class=\"field-content\"><a href=\"/opshtina-berovo-ne-mozhe-da-sostavi-sovet-0\">Проба 2</a></h3>";
    Matcher m=regExpMatcher.matcher(example);
    while(m.find())
    {

        System.out.println(m.group(1));
    }

I like to get the values Проба 1 and Проба 2 . However I only get the first value Проба 1 . What is my problem?

Answer 1

It is blasphemy to use regex + HTML. But if you really want to be cursed then here it is (you have been warned):

String regExp = "<h3 class=\"field-content\"><a[^>]*>([\\w\\s]+)</a></h3>";
                                                       ^updated part

Since Проба 1 and Проба 2 contains also spaces you need to include \\\\s to your pattern.

Answer 2

To discover the power of the dark side, you can try this pattern:

<h3 class=\"field-content\"><a[^>]*>([^<]+)</a></h3>

Don't forget to set the UNICODE_CASE before.

Extract text from html source using regular expressions java

Question

2 answers

solution1
4 ACCPTED 2013-06-09 21:20:01

solution2
1 2013-06-09 21:25:55

Extract text from html source using regular expressions java

Question

2 answers

solution1 4 ACCPTED 2013-06-09 21:20:01

solution2 1 2013-06-09 21:25:55

solution1
4 ACCPTED 2013-06-09 21:20:01

solution2
1 2013-06-09 21:25:55