使用Java和Regex帮助从html标记提取文本

Question

I would like to extract some text from an html file using Regex. 我想使用Regex从html文件中提取一些文本。 I am learning regex and I still have trouble understanding it all. 我正在学习正则表达式，但仍然很难理解所有内容。 I have a code which extracts all the text included betweeen <body> and </body> here it is: 我有一个代码提取<body>和</body>包含的所有文本，这里是：

public class Harn2 {

public static void main(String[] args) throws IOException{

String toMatch=readFile();
//Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?"); this one works fine
Pattern pattern=Pattern.compile(".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"); //I want this one to work
Matcher matcher=pattern.matcher(toMatch);

if(matcher.matches()) {
    System.out.println(matcher.group(1));
}

}

 private static String readFile() {

      try{
            // Open the file that is the first 
            // command line parameter
            FileInputStream fstream = new FileInputStream("user.html");
            // Get the object of DataInputStream
            DataInputStream in = new DataInputStream(fstream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String strLine = null;
            //Read File Line By Line
            while (br.readLine() != null)   {
                // Print the content on the console
                //System.out.println (strLine);
                strLine+=br.readLine();
            }
            //Close the input stream
            in.close();
            return strLine;
            }catch (Exception e){//Catch exception if any

                System.err.println("Error: " + e.getMessage());
                return "";
            }
}
}

Well it works fine like this but now I would like to extract the text between the tag: <table class="claroTable"> and </table> 很好，它可以像这样正常工作，但是现在我想提取标记之间的文本： <table class="claroTable">和</table>

So I replace my regex string by ".*?<table class=\\"claroTable\\".*?>(.*?)</table>.*?" 所以我用".*?<table class=\\"claroTable\\".*?>(.*?)</table>.*?"替换我的正则表达式字符串".*?<table class=\\"claroTable\\".*?>(.*?)</table>.*?" I have also tried ".*?<table class=\\"claroTable\\">(.*?)</table>.*?" 我也尝试过".*?<table class=\\"claroTable\\">(.*?)</table>.*?" but it doesn't work and I don't understand why. 但它不起作用，我也不明白为什么。 There is only one table in the html file but there is an occurence of "table" in a javascript code : "...dataTables.js..." could that be the reason for the mistake? html文件中只有一个表，但是javascript代码中出现了“表”：“ ... dataTables.js ...”可能是错误的原因吗？

Thank you in advance for helping me, 预先感谢您对我的帮助，

EDIT: the html text to extranct is something like: 编辑：引出的html文本类似于：

<body>
.....
<table class="claroTable">
<td><th>some data and manya many tags </td>
.....
</table>

What I would like to extract is anything between <table class="claroTable"> and </table> 我想提取的是<table class="claroTable">和</table>之间的任何内容。

Answer 1

Here's how you can do it with the JSoup parser : 这是使用JSoup解析器的方法：

File file = new File("path/to/your/file.html");
String charSet = "ISO-8859-1";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();

Yes, you can also somehow do it with regex, but it will never be this easy. 是的，您也可以使用正则表达式来完成此操作，但这绝不会那么容易。

Update: The main problem with your regex pattern is that you are missing the DOTALL flag: 更新： regex模式的主要问题是缺少DOTALL标志：

Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?",Pattern.DOTALL);

And if you just want the specified table tag with contents, you can do something like this: 而且，如果您只想要带有内容的指定表标签，则可以执行以下操作：

String tableTag = 
    Pattern.compile(".*?<table.*?claroTable.*?>(.*?)</table>.*?",Pattern.DOTALL)
           .matcher(html)
           .replaceFirst("$1");

(Updated: now returns the contents of the table tag only, not the table tag itself) （已更新：现在仅返回表标记的内容，而不返回表标记本身）

Answer 2

As stated, this is a bad place to use regex. 如前所述，这是使用正则表达式的不好的地方。 Only use regex when you actually need to, so basically try to stay away from it if you can. 仅在实际需要时才使用正则表达式，因此，如果可以的话，请尽量避免使用它。 Take a look at this post though for parsers: 请看一下这篇文章，以供解析器使用：

How to parse and modify HTML file in Java 如何在Java中解析和修改HTML文件

使用Java和Regex帮助从html标记提取文本

问题描述

2 个解决方案

解决方案1
6 已采纳 2011-08-29 09:24:48

解决方案2
0 2011-08-29 09:20:05

使用Java和Regex帮助从html标记提取文本

问题描述

2 个解决方案

解决方案1 6 已采纳 2011-08-29 09:24:48

解决方案2 0 2011-08-29 09:20:05

解决方案1
6 已采纳 2011-08-29 09:24:48

解决方案2
0 2011-08-29 09:20:05