[英]Help extracting text from html tag with Java and Regex
I would like to extract some text from an html file using Regex. 我想使用Regex从html文件中提取一些文本。 I am learning regex and I still have trouble understanding it all.
我正在学习正则表达式,但仍然很难理解所有内容。 I have a code which extracts all the text included betweeen
<body>
and </body>
here it is: 我有一个代码提取
<body>
和</body>
包含的所有文本,这里是:
public class Harn2 {
public static void main(String[] args) throws IOException{
String toMatch=readFile();
//Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?"); this one works fine
Pattern pattern=Pattern.compile(".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"); //I want this one to work
Matcher matcher=pattern.matcher(toMatch);
if(matcher.matches()) {
System.out.println(matcher.group(1));
}
}
private static String readFile() {
try{
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream("user.html");
// Get the object of DataInputStream
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine = null;
//Read File Line By Line
while (br.readLine() != null) {
// Print the content on the console
//System.out.println (strLine);
strLine+=br.readLine();
}
//Close the input stream
in.close();
return strLine;
}catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
return "";
}
}
}
Well it works fine like this but now I would like to extract the text between the tag: <table class="claroTable">
and </table>
很好,它可以像这样正常工作,但是现在我想提取标记之间的文本:
<table class="claroTable">
和</table>
So I replace my regex string by ".*?<table class=\\"claroTable\\".*?>(.*?)</table>.*?"
所以我用
".*?<table class=\\"claroTable\\".*?>(.*?)</table>.*?"
替换我的正则表达式字符串".*?<table class=\\"claroTable\\".*?>(.*?)</table>.*?"
I have also tried ".*?<table class=\\"claroTable\\">(.*?)</table>.*?"
我也尝试过
".*?<table class=\\"claroTable\\">(.*?)</table>.*?"
but it doesn't work and I don't understand why. 但它不起作用,我也不明白为什么。 There is only one table in the html file but there is an occurence of "table" in a javascript code : "...dataTables.js..." could that be the reason for the mistake?
html文件中只有一个表,但是javascript代码中出现了“表”:“ ... dataTables.js ...”可能是错误的原因吗?
Thank you in advance for helping me, 预先感谢您对我的帮助,
EDIT: the html text to extranct is something like: 编辑:引出的html文本类似于:
<body>
.....
<table class="claroTable">
<td><th>some data and manya many tags </td>
.....
</table>
What I would like to extract is anything between <table class="claroTable">
and </table>
我想提取的是
<table class="claroTable">
和</table>
之间的任何内容。
Here's how you can do it with the JSoup parser : 这是使用JSoup解析器的方法 :
File file = new File("path/to/your/file.html");
String charSet = "ISO-8859-1";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();
Yes, you can also somehow do it with regex, but it will never be this easy. 是的,您也可以使用正则表达式来完成此操作,但这绝不会那么容易。
Update: The main problem with your regex pattern is that you are missing the DOTALL
flag: 更新: regex模式的主要问题是缺少
DOTALL
标志:
Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?",Pattern.DOTALL);
And if you just want the specified table tag with contents, you can do something like this: 而且,如果您只想要带有内容的指定表标签,则可以执行以下操作:
String tableTag =
Pattern.compile(".*?<table.*?claroTable.*?>(.*?)</table>.*?",Pattern.DOTALL)
.matcher(html)
.replaceFirst("$1");
(Updated: now returns the contents of the table tag only, not the table tag itself) (已更新:现在仅返回表标记的内容,而不返回表标记本身)
As stated, this is a bad place to use regex. 如前所述,这是使用正则表达式的不好的地方。 Only use regex when you actually need to, so basically try to stay away from it if you can.
仅在实际需要时才使用正则表达式,因此,如果可以的话,请尽量避免使用它。 Take a look at this post though for parsers:
请看一下这篇文章,以供解析器使用:
How to parse and modify HTML file in Java 如何在Java中解析和修改HTML文件
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.