简体   繁体   中英

Extract all HTML tags, including closing tags, from file using Java without using external library like Jsoup

I have this code that will take in a HTML file, get all the opening HTML tags, and then print them. I was wondering if there was a way to also include the closing tags within this code. So right now it prints:

<html>
<head>
<title>
<body>
<table>
<p>
<a>
<p>
etc. etc.

I'm looking for it to print with the closing tags as well.

<p>
<a>
</a>
</p>

Here's the code I have thus far:

        try {
        BufferedReader in = new BufferedReader(new FileReader("test.html"));
        String line;
        StringBuilder stringBuilder = new StringBuilder();
        while ((line = in.readLine()) != null) {
            stringBuilder.append(line);
        }
        String pageContent = stringBuilder.toString();
        Pattern pattern = Pattern.compile("<(?!!)(?!/)\\s*([a-zA-Z0-9]+)(.*?)>");
        Matcher matcher = pattern.matcher(pageContent);
        while (matcher.find()) {
            String tagName = matcher.group(1);
            System.out.println("<" + tagName + ">");
        }
        in.close();
    }

Edit: Is there a way to do it without using an external library like Jsoup? Edit 2: I changed my Pattern.compile to this-> <([a-zA-Z0-9]+|/[a-zA-Z0-9]+)(.*?)> and it worked. Thanks.

If its fine to use external library you can go with JSoup as described here. Extract Tags from a html file using Jsoup

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM