简体   繁体   English

使用Java和Regex帮助从html标记提取文本

[英]Help extracting text from html tag with Java and Regex

I would like to extract some text from an html file using Regex. 我想使用Regex从html文件中提取一些文本。 I am learning regex and I still have trouble understanding it all. 我正在学习正则表达式,但仍然很难理解所有内容。 I have a code which extracts all the text included betweeen <body> and </body> here it is: 我有一个代码提取<body></body>包含的所有文本,这里是:

public class Harn2 {

public static void main(String[] args) throws IOException{

String toMatch=readFile();
//Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?"); this one works fine
Pattern pattern=Pattern.compile(".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"); //I want this one to work
Matcher matcher=pattern.matcher(toMatch);

if(matcher.matches()) {
    System.out.println(matcher.group(1));
}

}

 private static String readFile() {

      try{
            // Open the file that is the first 
            // command line parameter
            FileInputStream fstream = new FileInputStream("user.html");
            // Get the object of DataInputStream
            DataInputStream in = new DataInputStream(fstream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String strLine = null;
            //Read File Line By Line
            while (br.readLine() != null)   {
                // Print the content on the console
                //System.out.println (strLine);
                strLine+=br.readLine();
            }
            //Close the input stream
            in.close();
            return strLine;
            }catch (Exception e){//Catch exception if any

                System.err.println("Error: " + e.getMessage());
                return "";
            }
}
}

Well it works fine like this but now I would like to extract the text between the tag: <table class="claroTable"> and </table> 很好,它可以像这样正常工作,但是现在我想提取标记之间的文本: <table class="claroTable"></table>

So I replace my regex string by ".*?<table class=\\"claroTable\\".*?>(.*?)</table>.*?" 所以我用".*?<table class=\\"claroTable\\".*?>(.*?)</table>.*?"替换我的正则表达式字符串".*?<table class=\\"claroTable\\".*?>(.*?)</table>.*?" I have also tried ".*?<table class=\\"claroTable\\">(.*?)</table>.*?" 我也尝试过".*?<table class=\\"claroTable\\">(.*?)</table>.*?" but it doesn't work and I don't understand why. 但它不起作用,我也不明白为什么。 There is only one table in the html file but there is an occurence of "table" in a javascript code : "...dataTables.js..." could that be the reason for the mistake? html文件中只有一个表,但是javascript代码中出现了“表”:“ ... dataTables.js ...”可能是错误的原因吗?

Thank you in advance for helping me, 预先感谢您对我的帮助,

EDIT: the html text to extranct is something like: 编辑:引出的html文本类似于:

<body>
.....
<table class="claroTable">
<td><th>some data and manya many tags </td>
.....
</table>

What I would like to extract is anything between <table class="claroTable"> and </table> 我想提取的是<table class="claroTable"></table>之间的任何内容。

Here's how you can do it with the JSoup parser : 这是使用JSoup解析器的方法

File file = new File("path/to/your/file.html");
String charSet = "ISO-8859-1";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();

Yes, you can also somehow do it with regex, but it will never be this easy. 是的,您也可以使用正则表达式来完成此操作,但这绝不会那么容易。

Update: The main problem with your regex pattern is that you are missing the DOTALL flag: 更新: regex模式的主要问题是缺少DOTALL标志:

Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?",Pattern.DOTALL);

And if you just want the specified table tag with contents, you can do something like this: 而且,如果您只想要带有内容的指定表标签,则可以执行以下操作:

String tableTag = 
    Pattern.compile(".*?<table.*?claroTable.*?>(.*?)</table>.*?",Pattern.DOTALL)
           .matcher(html)
           .replaceFirst("$1");

(Updated: now returns the contents of the table tag only, not the table tag itself) (已更新:现在仅返回表标记的内容,而不返回表标记本身)

As stated, this is a bad place to use regex. 如前所述,这是使用正则表达式的不好的地方。 Only use regex when you actually need to, so basically try to stay away from it if you can. 仅在实际需要时才使用正则表达式,因此,如果可以的话,请尽量避免使用它。 Take a look at this post though for parsers: 请看一下这篇文章,以供解析器使用:

How to parse and modify HTML file in Java 如何在Java中解析和修改HTML文件

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM