使用Java无需使用Jsoup之类的外部库即可从文件中提取所有HTML标签，包括关闭标签

Question

I have this code that will take in a HTML file, get all the opening HTML tags, and then print them. 我有此代码，它将接收一个HTML文件，获取所有打开的HTML标记，然后打印它们。 I was wondering if there was a way to also include the closing tags within this code. 我想知道是否有办法在此代码中也包含结束标记。 So right now it prints: 所以现在它打印：

<html>
<head>
<title>
<body>
<table>
<p>
<a>
<p>
etc. etc.

I'm looking for it to print with the closing tags as well. 我也在寻找要与结束标签一起打印的标签。

<p>
<a>
</a>
</p>

Here's the code I have thus far: 到目前为止，这是我的代码：

        try {
        BufferedReader in = new BufferedReader(new FileReader("test.html"));
        String line;
        StringBuilder stringBuilder = new StringBuilder();
        while ((line = in.readLine()) != null) {
            stringBuilder.append(line);
        }
        String pageContent = stringBuilder.toString();
        Pattern pattern = Pattern.compile("<(?!!)(?!/)\\s*([a-zA-Z0-9]+)(.*?)>");
        Matcher matcher = pattern.matcher(pageContent);
        while (matcher.find()) {
            String tagName = matcher.group(1);
            System.out.println("<" + tagName + ">");
        }
        in.close();
    }

Edit: Is there a way to do it without using an external library like Jsoup? 编辑：有没有一种方法，而无需使用Jsoup这样的外部库？ Edit 2: I changed my Pattern.compile to this-> <([a-zA-Z0-9]+|/[a-zA-Z0-9]+)(.*?)> and it worked. 编辑2：我将Pattern.compile更改为this-> <（[[a-zA-Z0-9] + | / [a-zA-Z0-9] +）（。*？）>，它可以正常工作。 Thanks. 谢谢。

Answer 1

If its fine to use external library you can go with JSoup as described here. 如果可以使用外部库，可以按照此处所述使用JSoup。 Extract Tags from a html file using Jsoup 使用Jsoup从html文件中提取标签

使用Java无需使用Jsoup之类的外部库即可从文件中提取所有HTML标签，包括关闭标签

问题描述

1 个解决方案

解决方案1
1 2014-12-08 19:31:10

使用Java无需使用Jsoup之类的外部库即可从文件中提取所有HTML标签，包括关闭标签

问题描述

1 个解决方案

解决方案1 1 2014-12-08 19:31:10

解决方案1
1 2014-12-08 19:31:10