[英]Extract all HTML tags, including closing tags, from file using Java without using external library like Jsoup
I have this code that will take in a HTML file, get all the opening HTML tags, and then print them. 我有此代码,它将接收一个HTML文件,获取所有打开的HTML标记,然后打印它们。 I was wondering if there was a way to also include the closing tags within this code.
我想知道是否有办法在此代码中也包含结束标记。 So right now it prints:
所以现在它打印:
<html>
<head>
<title>
<body>
<table>
<p>
<a>
<p>
etc. etc.
I'm looking for it to print with the closing tags as well. 我也在寻找要与结束标签一起打印的标签。
<p>
<a>
</a>
</p>
Here's the code I have thus far: 到目前为止,这是我的代码:
try {
BufferedReader in = new BufferedReader(new FileReader("test.html"));
String line;
StringBuilder stringBuilder = new StringBuilder();
while ((line = in.readLine()) != null) {
stringBuilder.append(line);
}
String pageContent = stringBuilder.toString();
Pattern pattern = Pattern.compile("<(?!!)(?!/)\\s*([a-zA-Z0-9]+)(.*?)>");
Matcher matcher = pattern.matcher(pageContent);
while (matcher.find()) {
String tagName = matcher.group(1);
System.out.println("<" + tagName + ">");
}
in.close();
}
Edit: Is there a way to do it without using an external library like Jsoup? 编辑:有没有一种方法,而无需使用Jsoup这样的外部库? Edit 2: I changed my Pattern.compile to this-> <([a-zA-Z0-9]+|/[a-zA-Z0-9]+)(.*?)> and it worked.
编辑2:我将Pattern.compile更改为this-> <([[a-zA-Z0-9] + | / [a-zA-Z0-9] +)(。*?)>,它可以正常工作。 Thanks.
谢谢。
If its fine to use external library you can go with JSoup as described here. 如果可以使用外部库,可以按照此处所述使用JSoup。 Extract Tags from a html file using Jsoup
使用Jsoup从html文件中提取标签
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.