Need help on how to make regex and why this one doesn't work

Question

This is the text:

 <div class="center-content"> <h2> <a href="https://lapiedradesisifo.com/2019/11/04/la-silenciosa-linea-del-idioma-no-hablado/" class="l:3207185" > La silenciosa línea del idioma no hablado </a>

My code:

Pattern p = Pattern.compile("<div class=\"center-content\"> *<h2> <a.{10,200} >(.{50,200})</a>");
Matcher m = p.matcher(text);

StringBuilder sb = new StringBuilder();
while(m.find()){
    sb.append(m.group(1) + "\n");
}

System.out.println(sb.toString());

This is what I expected to be printed on the screen:

"La silenciosa línea del idioma no hablado"

But nothing is being printed, I really don't know why because I've tried it with similar examples and it works.

I gotta be honest, I got this regex with some help and I still don't really understand how it works, would really appreciate some help with this one.

Answer 1

The "." does not match newlines by default. The html you want to parse seems to contain newlines.

You can use Pattern.compile("pattern",Pattern.DOTALL) to make "." match newlines too. Even with that, your regex will not match. You can use some online tester to find out what's wrong ("La silenciosa línea del idioma no hablado" is < 50 chars, new line in "center-content")

https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#DOTALL https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#compile(java.lang.String,%20int)

Answer 2

As Mike pointed out in the comment, use a proper HTML parser for processing HTML input. However, if you are interested in how your regex works, I'll try to briefly describe it.

Current pattern

Your current pattern works as follows

<div class=\"center-content\"> - matches literally <div class="center-content">

*<h2> - matches any character between zero and unlimited times, followed by <h2>

<a.{10,200} > - matches <a followed by any character between 10 and 200 times, followed by character >

(.{50,200}) - this one matches any character between 50 and 200 times and captures it into a group. This is, by the way, what you access in your code by calling m.group(1)

</a> - matches </a> literally

Simplified version

However, if your goal is just to capture a text wrapped within a element, you can simplify your regex to <a\s+href=.*?>(.*?)</a> which works as follows:

<a\s+href= - matches <a href=

.*?> - matches URL part of a (any character between 0 and unlimited times, as few times as possible) element followed by >

(.*?) - captures anything in between > and < (as few times as possible) - call .group(1) to get it

</a> - matches </a>

Need help on how to make regex and why this one doesn't work

Question

2 answers

solution1
0 2019-11-05 22:24:10

solution2
0 ACCPTED 2019-11-05 22:40:33

Current pattern

Simplified version

Need help on how to make regex and why this one doesn't work

Question

2 answers

solution1 0 2019-11-05 22:24:10

solution2 0 ACCPTED 2019-11-05 22:40:33

Current pattern

Simplified version

solution1
0 2019-11-05 22:24:10

solution2
0 ACCPTED 2019-11-05 22:40:33