Java正则表达式匹配HTML

Question

solution: this works: 解决方案：

String p="<pre>[\\\\w\\\\W]*</pre>";

I want to match and capture the enclosing content of the <pre></pre> tag tried the following, not working, what's wrong? 我想匹配并捕获<pre> </ pre>标记的包围内容，尝试了以下操作，但不起作用，怎么了？

String p="<pre>.*</pre>";

        Matcher m=Pattern.compile(p,Pattern.MULTILINE|Pattern.CASE_INSENSITIVE).matcher(input);
        if(m.find()){
            String g=m.group(0);
            System.out.println("g is "+g);
        }

Answer 1

You want the DOTALL flag, not MULTILINE. 您需要DOTALL标志，而不是MULTILINE。 MULTILINE changes the behavior of the ^ and $ , while DOTALL is the one that lets . MULTILINE改变^和$的行为，而DOTALL是允许的行为. match line separators. 匹配行分隔符。 You probably want to use a reluctant quantifier, too: 您可能也想使用勉强的量词：

String p = "<pre>.*?</pre>";

Answer 2

Regex is in fact not the right tool for this. 实际上，正则表达式不是正确的工具。 Use a parser. 使用解析器。 Jsoup is a nice one. Jsoup是一个不错的选择。

Document document = Jsoup.parse(html);
for (Element element : document.getElementsByTag("pre")) {
    System.out.println(element.text());
}

The parse() method can also take an URL or File by the way. parse()方法还可以采用URL或File 。

The reason I recommend Jsoup is by the way that it is the least verbose of all HTML parsers I tried. 我之所以推荐Jsoup的原因是，它是我尝试过的所有HTML解析器中最不冗长的。 It not only provides JavaScript like methods returning elements implementing Iterable , but it also supports jQuery like selectors and that was a big plus for me. 它不仅提供类似JavaScript的方法，返回实现Iterable元素，而且还支持类似选择器的jQuery ，这对我来说是一大好处。

Answer 3

String stringToSearch = "H1 FOUR H1 SCORE AND SEVEN YEARS AGO OUR FATHER...";

// the case-insensitive pattern we want to search for
Pattern p = Pattern.compile("H1", Pattern.CASE_INSENSITIVE);

Matcher m = p.matcher(stringToSearch);

// see if we found a match
int count = 0;
while (m.find())
    count++;

System.out.println("H1 : "+count);

Java正则表达式匹配HTML

问题描述

3 个解决方案

解决方案1
3 2010-05-08 00:30:27

解决方案2
3 2010-05-08 00:36:20

解决方案3
1 2015-07-26 19:00:44

Java正则表达式匹配HTML

问题描述

3 个解决方案

解决方案1 3 2010-05-08 00:30:27

解决方案2 3 2010-05-08 00:36:20

解决方案3 1 2015-07-26 19:00:44

解决方案1
3 2010-05-08 00:30:27

解决方案2
3 2010-05-08 00:36:20

解决方案3
1 2015-07-26 19:00:44