无法在Java中获得正则表达式的匹配项

Question

This is the format/example of the string I want to get data: 这是我想要获取数据的字符串的格式/示例：

<span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#B82933;font-size:120%' href='/cartelera/pelicula/18312'>Español  </a></span><br><span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#FBEBC4;font-size:120%' href='/cartelera/pelicula/18313'>Subtitulada  </a></span><br>          </div>

And this is the regular expression I'm using for it: 这是我正在使用的正则表达式：

"pelicula/([0-9]*)'>([\\w\\s]*)</a>"

I tested this regular expression in RegexPlanet , and it turned out OK, it gave me the expected result: 我在RegexPlanet中测试了这个正则表达式，结果没问题，它给了我预期的结果：

group(1) = 18313
group(2) = Subtitulada

But when I try to implement that regular expression in Java, it won't match anything. 但是当我尝试在Java中实现该正则表达式时，它将无法匹配任何内容。 Here's the code: 这是代码：

Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>");              
            Matcher matcher = pattern.matcher(inputLine);            
            while(matcher.find()){
                    version = matcher.group(2);
                }
            }

What's the problem? 有什么问题？ If the regular expression is already tested, and in that same code I search for more patterns but I'm having trouble with two (I'm showing you here just one). 如果正则表达式已经过测试，并且在相同的代码中我搜索了更多的模式，但我遇到了两个问题（我在这里只展示了一个）。 Thank you in advance! 先感谢您！

_ EDIT _ _ _ 编辑 _ _

I discovered the problem... If I check the sourcecode of the page it shows everything, but when I try to consume it from Java, it gets another sourcecode. 我发现了问题...如果我检查页面的源代码它会显示所有内容，但是当我尝试从Java中使用它时，它会获得另一个源代码。 Why? 为什么？ Because this page asks for your city so it can show information about that. 因为此页面要求您的城市，所以它可以显示有关该城市的信息。 I don't know if there's a workaround about that to actually access the information I want, but that's it. 我不知道是否有关于实际访问我想要的信息的解决方法，但就是这样。

Answer 1

Your regex is correct but it seems \\w does not match ñ . 你的正则表达式是正确的，但似乎\\w不匹配ñ 。

I changed the regex to 我改变了正则表达式

"pelicula/([0-9]*)'>(.*?)</a>"

and it seems to match both the occurrences. 它似乎匹配两个事件。 Here I've used the reluctant *? 我在这里使用了不情愿的*? operator to prevent .* match all characters in between first <a> till last <\\a> See What is the difference between `Greedy` and `Reluctant` regular expression quantifiers? 运算符以防止.*匹配所有字符之间的第一个<a>直到最后<\\a>请参阅`Greedy`和`Reluctant`正则表达式量词之间有什么区别？ for explanation. 作出解释。

@Bohemian is correct in pointing out that you might need to enable the Pattern.DOTALL flag as well if the text in <a> has line breaks @Bohemian指出你可能需要启用Pattern.DOTALL标志，如果<a>的文本有换行符

Answer 2

If your input is over several lines (ie it contains newline characters) you'll need to turn on "dot matches newline". 如果您的输入超过几行（即它包含换行符），则需要打开“点匹配换行符”。

There are two way to do this: 有两种方法可以做到这一点：

Use the "dot matches newline" regex switch (?s) in your regex: 在正则表达式中使用“dot matches newline”正则表达式开关(?s) ：

Pattern pattern = Pattern.compile("(?s)pelicula/([0-9]*)'>([\\w\\s]*)</a>");

or use the Pattern.DOTALL flag in the call to Pattern.compile() : 或者在调用Pattern.compile()使用Pattern.DOTALL标志：

Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>", Pattern.DOTALL);

无法在Java中获得正则表达式的匹配项

问题描述

2 个解决方案

解决方案1
2 2012-11-16 14:49:46

解决方案2
1 2012-11-16 02:08:26

无法在Java中获得正则表达式的匹配项

问题描述

2 个解决方案

解决方案1 2 2012-11-16 14:49:46

解决方案2 1 2012-11-16 02:08:26

解决方案1
2 2012-11-16 14:49:46

解决方案2
1 2012-11-16 02:08:26