[英]java Regular expression matching html
solution: this works: 解决方案:
String p="<pre>[\\\\w\\\\W]*</pre>";
I want to match and capture the enclosing content of the <pre></pre> tag tried the following, not working, what's wrong? 我想匹配并捕获<pre> </ pre>标记的包围内容,尝试了以下操作,但不起作用,怎么了?
String p="<pre>.*</pre>"; Matcher m=Pattern.compile(p,Pattern.MULTILINE|Pattern.CASE_INSENSITIVE).matcher(input); if(m.find()){ String g=m.group(0); System.out.println("g is "+g); }
You want the DOTALL flag, not MULTILINE. 您需要DOTALL标志,而不是MULTILINE。 MULTILINE changes the behavior of the
^
and $
, while DOTALL is the one that lets .
MULTILINE改变
^
和$
的行为,而DOTALL是允许的行为.
match line separators. 匹配行分隔符。 You probably want to use a reluctant quantifier, too:
您可能也想使用勉强的量词:
String p = "<pre>.*?</pre>";
Regex is in fact not the right tool for this. 实际上,正则表达式不是正确的工具。 Use a parser.
使用解析器。 Jsoup is a nice one.
Jsoup是一个不错的选择。
Document document = Jsoup.parse(html);
for (Element element : document.getElementsByTag("pre")) {
System.out.println(element.text());
}
The parse()
method can also take an URL
or File
by the way. parse()
方法还可以采用URL
或File
。
The reason I recommend Jsoup is by the way that it is the least verbose of all HTML parsers I tried. 我之所以推荐Jsoup的原因是,它是我尝试过的所有HTML解析器中最不冗长的。 It not only provides JavaScript like methods returning elements implementing
Iterable
, but it also supports jQuery like selectors and that was a big plus for me. 它不仅提供类似JavaScript的方法,返回实现
Iterable
元素,而且还支持类似选择器的jQuery ,这对我来说是一大好处。
String stringToSearch = "H1 FOUR H1 SCORE AND SEVEN YEARS AGO OUR FATHER...";
// the case-insensitive pattern we want to search for
Pattern p = Pattern.compile("H1", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(stringToSearch);
// see if we found a match
int count = 0;
while (m.find())
count++;
System.out.println("H1 : "+count);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.