简体   繁体   中英

Can't get a match for regular expression in Java

This is the format/example of the string I want to get data:

<span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#B82933;font-size:120%' href='/cartelera/pelicula/18312'>Español  </a></span><br><span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#FBEBC4;font-size:120%' href='/cartelera/pelicula/18313'>Subtitulada  </a></span><br>          </div>

And this is the regular expression I'm using for it:

"pelicula/([0-9]*)'>([\\w\\s]*)</a>"

I tested this regular expression in RegexPlanet , and it turned out OK, it gave me the expected result:

group(1) = 18313
group(2) = Subtitulada

But when I try to implement that regular expression in Java, it won't match anything. Here's the code:

Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>");              
            Matcher matcher = pattern.matcher(inputLine);            
            while(matcher.find()){
                    version = matcher.group(2);
                }
            }

What's the problem? If the regular expression is already tested, and in that same code I search for more patterns but I'm having trouble with two (I'm showing you here just one). Thank you in advance!

_ EDIT _ _

I discovered the problem... If I check the sourcecode of the page it shows everything, but when I try to consume it from Java, it gets another sourcecode. Why? Because this page asks for your city so it can show information about that. I don't know if there's a workaround about that to actually access the information I want, but that's it.

Your regex is correct but it seems \\w does not match ñ .

I changed the regex to

"pelicula/([0-9]*)'>(.*?)</a>"

and it seems to match both the occurrences. Here I've used the reluctant *? operator to prevent .* match all characters in between first <a> till last <\\a> See What is the difference between `Greedy` and `Reluctant` regular expression quantifiers? for explanation.

@Bohemian is correct in pointing out that you might need to enable the Pattern.DOTALL flag as well if the text in <a> has line breaks

If your input is over several lines (ie it contains newline characters) you'll need to turn on "dot matches newline".

There are two way to do this:

Use the "dot matches newline" regex switch (?s) in your regex:

Pattern pattern = Pattern.compile("(?s)pelicula/([0-9]*)'>([\\w\\s]*)</a>");

or use the Pattern.DOTALL flag in the call to Pattern.compile() :

Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>", Pattern.DOTALL);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM