简体   繁体   中英

Extract data from HTML page in between Tags using HTMLUNIT

I'm trying to extract data from web page using Html Unit. I've already achieved this by converting HtmlPage to text and then extracted data by using regular expression out of that HTML page. I've also achieved to extract data from Html tables using class attribute in Html.

I want to use HtmlUnit again fully for all extraction to learn for the same requirement I have done using regular expression. Am not able to get how can I extract data within tags in the form of key value pair.

Here is the sample Html data

<div class="top_red_bar">
    <div id="site-breadcrumbs">
        <a href="/admin/index.jsp" title="Home">Home</a>
        &#124;
        <a href="/admin/queues.jsp" title="Queues">Queues</a>
        &#124;
        <a href="/admin/topics.jsp" title="Topics">Topics</a>
        &#124;
        <a href="/admin/subscribers.jsp" title="Subscribers">Subscribers</a>
        &#124;
        <a href="/admin/connections.jsp" title="Connections">Connections</a>
        &#124;
        <a href="/admin/network.jsp" title="Network">Network</a>
        &#124;
         <a href="/admin/scheduled.jsp" title="Scheduled">Scheduled</a>
        &#124;
        <a href="/admin/send.jsp"
           title="Send">Send</a>
    </div>
    <div id="site-quicklinks"><P>
        <a href="http://activemq.apache.org/support.html"
           title="Get help and support using Apache ActiveMQ">Support</a></p>
    </div>
</div>

<table border="0">
<tbody>
    <tr>
        <td valign="top" width="100%" style="overflow:hidden;">
            <div class="body-content">


<h2>Welcome!</h2>

<p>
Welcome to the Apache ActiveMQ Console of <b>localhost</b> (ID:TOOLCONTROLPJX526-524666-65544585445-2:3)
</p>

<p>
You can find more information about Apache ActiveMQ on the <a href="http://activemq.apache.org/">Apache ActiveMQ Site</a>
</p>

<h2>Broker</h2>


<table>
    <tr>
        <td>Name</td>
        <td><b>localhost</b></td>
    </tr>
    <tr>
        <td>Version</td>
        <td><b>5.13.3</b></td>
    </tr>
    <tr>
        <td>ID</td>
        <td><b>ID:TOOLCONTROLPJX526-524666-65544585445-2:3</b></td>
    </tr>
    <tr>
        <td>Uptime</td>
        <td><b>17 days 13 hours</b></td>
    </tr>
    <tr>
        <td>Store percent used</td>
        <td><b>19</b></td>
    </tr>
    <tr>
        <td>Memory percent used</td>
        <td><b>0</b></td>
    </tr>
    <tr>
        <td>Temp percent used</td>
        <td><b>0</b></td>
    </tr>
</table>

I want to extract data in between table tag. Expected output

Name:localhost
Version:5.13.3
ID:ID:TOOLCONTROLPJX526-524666-65544585445-2:3
Uptime:7 days 13 hours
Store percent used:19
Memory percent used:0
Temp percent used:0

How it can be achieved? I want to know which methods to be used within HTLM unit to achieve this.

This are the steps i followed (not the only solution)

  1. parse the string through parseHtml method with dummy url
  2. get the second table by xpath
  3. iterate with double nested loop (for and iterator -to append separator correctly-)

ExtractTableData:

import java.net.URL;

import com.gargoylesoftware.htmlunit.StringWebResponse;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HTMLParser;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTable;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow.CellIterator;


public class ExtractTableData {

    public static void main(String[] args) throws Exception {

        String html = "<div class=\"top_red_bar\">\n" + "                        <div id=\"site-breadcrumbs\">\n"
                + "                            <a href=\"/admin/index.jsp\" title=\"Home\">Home</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/queues.jsp\" title=\"Queues\">Queues</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/topics.jsp\" title=\"Topics\">Topics</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/subscribers.jsp\" title=\"Subscribers\">Subscribers</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/connections.jsp\" title=\"Connections\">Connections</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/network.jsp\" title=\"Network\">Network</a>\n"
                + "                            &#124;\n"
                + "                             <a href=\"/admin/scheduled.jsp\" title=\"Scheduled\">Scheduled</a>\n"
                + "                            &#124;\n" + "                            <a href=\"/admin/send.jsp\"\n"
                + "                               title=\"Send\">Send</a>\n" + "                        </div>\n"
                + "                        <div id=\"site-quicklinks\"><P>\n"
                + "                            <a href=\"http://activemq.apache.org/support.html\"\n"
                + "                               title=\"Get help and support using Apache ActiveMQ\">Support</a></p>\n"
                + "                        </div>\n" + "                    </div>\n" + "\n"
                + "                    <table border=\"0\">\n" + "                        <tbody>\n"
                + "                            <tr>\n"
                + "                                <td valign=\"top\" width=\"100%\" style=\"overflow:hidden;\">\n"
                + "                                    <div class=\"body-content\">\n" + "\n" + "\n"
                + "<h2>Welcome!</h2>\n" + "\n" + "<p>\n"
                + "Welcome to the Apache ActiveMQ Console of <b>localhost</b> (ID:TOOLCONTROLPJX526-524666-65544585445-2:3)\n"
                + "</p>\n" + "\n" + "<p>\n"
                + "You can find more information about Apache ActiveMQ on the <a href=\"http://activemq.apache.org/\">Apache ActiveMQ Site</a>\n"
                + "</p>\n" + "\n" + "<h2>Broker</h2>\n" + "\n" + "\n" + "<table>\n" + "    <tr>\n"
                + "        <td>Name</td>\n" + "        <td><b>localhost</b></td>\n" + "    </tr>\n" + "    <tr>\n"
                + "        <td>Version</td>\n" + "        <td><b>5.13.3</b></td>\n" + "    </tr>\n" + "    <tr>\n"
                + "        <td>ID</td>\n" + "        <td><b>ID:TOOLCONTROLPJX526-524666-65544585445-2:3</b></td>\n"
                + "    </tr>\n" + "    <tr>\n" + "        <td>Uptime</td>\n"
                + "        <td><b>17 days 13 hours</b></td>\n" + "    </tr>\n" + "    <tr>\n"
                + "        <td>Store percent used</td>\n" + "        <td><b>19</b></td>\n" + "    </tr>\n"
                + "    <tr>\n" + "        <td>Memory percent used</td>\n" + "        <td><b>0</b></td>\n"
                + "    </tr>\n" + "    <tr>\n" + "        <td>Temp percent used</td>\n" + "        <td><b>0</b></td>\n"
                + "    </tr>\n" + "</table>";
        WebClient webClient = new WebClient();
        HtmlPage page = HTMLParser.parseHtml(new StringWebResponse(html, new URL("http://dummy.url.for.parsing.com/")),
                webClient.getCurrentWindow());

        final HtmlTable table = (HtmlTable) page.getByXPath("//table").get(1);

        for (final HtmlTableRow row : table.getRows()) {

            CellIterator cellIterator = row.getCellIterator();

            if (cellIterator.hasNext()) {
                System.out.print(cellIterator.next().asText());
                while (cellIterator.hasNext()) {
                    System.out.print(":" + cellIterator.next().asText());
                }
            }
            System.out.println();
        }

    }

}

Output:

Name:localhost
Version:5.13.3
ID:ID:TOOLCONTROLPJX526-524666-65544585445-2:3
Uptime:17 days 13 hours
Store percent used:19
Memory percent used:0
Temp percent used:0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM