[英]Can't get a match for regular expression in Java
This is the format/example of the string I want to get data: 这是我想要获取数据的字符串的格式/示例:
<span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#B82933;font-size:120%' href='/cartelera/pelicula/18312'>Español </a></span><br><span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#FBEBC4;font-size:120%' href='/cartelera/pelicula/18313'>Subtitulada </a></span><br> </div>
And this is the regular expression I'm using for it: 这是我正在使用的正则表达式:
"pelicula/([0-9]*)'>([\\w\\s]*)</a>"
I tested this regular expression in RegexPlanet , and it turned out OK, it gave me the expected result: 我在RegexPlanet中测试了这个正则表达式,结果没问题,它给了我预期的结果:
group(1) = 18313
group(2) = Subtitulada
But when I try to implement that regular expression in Java, it won't match anything. 但是当我尝试在Java中实现该正则表达式时,它将无法匹配任何内容。 Here's the code:
这是代码:
Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>");
Matcher matcher = pattern.matcher(inputLine);
while(matcher.find()){
version = matcher.group(2);
}
}
What's the problem? 有什么问题? If the regular expression is already tested, and in that same code I search for more patterns but I'm having trouble with two (I'm showing you here just one).
如果正则表达式已经过测试,并且在相同的代码中我搜索了更多的模式,但我遇到了两个问题(我在这里只展示了一个)。 Thank you in advance!
先感谢您!
_ EDIT _ _ _ 编辑 _ _
I discovered the problem... If I check the sourcecode of the page it shows everything, but when I try to consume it from Java, it gets another sourcecode. 我发现了问题...如果我检查页面的源代码它会显示所有内容,但是当我尝试从Java中使用它时,它会获得另一个源代码。 Why?
为什么? Because this page asks for your city so it can show information about that.
因为此页面要求您的城市,所以它可以显示有关该城市的信息。 I don't know if there's a workaround about that to actually access the information I want, but that's it.
我不知道是否有关于实际访问我想要的信息的解决方法,但就是这样。
Your regex is correct but it seems \\w
does not match ñ
. 你的正则表达式是正确的,但似乎
\\w
不匹配ñ
。
I changed the regex to 我改变了正则表达式
"pelicula/([0-9]*)'>(.*?)</a>"
and it seems to match both the occurrences. 它似乎匹配两个事件。 Here I've used the reluctant
*?
我在这里使用了不情愿的
*?
operator to prevent .*
match all characters in between first <a>
till last <\\a>
See What is the difference between `Greedy` and `Reluctant` regular expression quantifiers? 运算符以防止
.*
匹配所有字符之间的第一个<a>
直到最后<\\a>
请参阅`Greedy`和`Reluctant`正则表达式量词之间有什么区别? for explanation. 作出解释。
@Bohemian is correct in pointing out that you might need to enable the Pattern.DOTALL
flag as well if the text in <a>
has line breaks @Bohemian指出你可能需要启用
Pattern.DOTALL
标志,如果<a>
的文本有换行符
If your input is over several lines (ie it contains newline characters) you'll need to turn on "dot matches newline". 如果您的输入超过几行(即它包含换行符),则需要打开“点匹配换行符”。
There are two way to do this: 有两种方法可以做到这一点:
Use the "dot matches newline" regex switch (?s)
in your regex: 在正则表达式中使用“dot matches newline”正则表达式开关
(?s)
:
Pattern pattern = Pattern.compile("(?s)pelicula/([0-9]*)'>([\\w\\s]*)</a>");
or use the Pattern.DOTALL
flag in the call to Pattern.compile()
: 或者在调用
Pattern.compile()
使用Pattern.DOTALL
标志:
Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>", Pattern.DOTALL);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.