I like to extract text from html page using regular expressions. Here is my code:
String regExp="<h3 class=\"field-content\"><a[^>]*>(\\w+)</a></h3>";
Pattern regExpMatcher=Pattern.compile(regExp,Pattern.UNICODE_CHARACTER_CLASS);
String example="<h3 class=\"field-content\"><a href=\"/humana-akcija-na-kavadarechkite-navivachi-lozari\">Проба 1</a></h3><h3 class=\"field-content\"><a href=\"/opshtina-berovo-ne-mozhe-da-sostavi-sovet-0\">Проба 2</a></h3>";
Matcher m=regExpMatcher.matcher(example);
while(m.find())
{
System.out.println(m.group(1));
}
I like to get the values Проба 1
and Проба 2
. However I only get the first value Проба 1
. What is my problem?
It is blasphemy to use regex + HTML. But if you really want to be cursed then here it is (you have been warned):
String regExp = "<h3 class=\"field-content\"><a[^>]*>([\\w\\s]+)</a></h3>";
^updated part
Since Проба 1
and Проба 2
contains also spaces you need to include \\\\s
to your pattern.
To discover the power of the dark side, you can try this pattern:
<h3 class=\"field-content\"><a[^>]*>([^<]+)</a></h3>
Don't forget to set the UNICODE_CASE before.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.