简体   繁体   中英

Extract and Parse HTML Table using Jsoup

How could I use Jsoup to extract specification data from this website separately for each row eg Network->Network Type, Battery etc.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class mobilereviews {
    public static void main(String[] args) throws Exception {
        Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
        for (Element table : doc.select("table")) {
            for (Element row : table.select("tr")) {
                Elements tds = row.select("td");
                System.out.println(tds.get(0).text());   
            }
        }
    }
}

Here is an attempt to find the solution to your problem

Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();

for (Element table : doc.select("table[id=phone_details]")) {
     for (Element row : table.select("tr:gt(2)")) {
        Elements tds = row.select("td:not([rowspan])");
        System.out.println(tds.get(0).text() + "->" + tds.get(1).text());
     }
}

Parsing the HTML is tricky and if the HTML changes your code needs to change as well.

You need to study the HTML markup to come up with your parsing rules first.

  • There are multiple tables in the HTML, so you first filter on the correct one table[id=phone_details]
  • The first 2 table rows contain only markup for formatting, so skip them tr:gt(2)
  • Every other row starts with the global description for the content type, filter it out td:not([rowspan])

For more complex options in the selector syntax, look here http://jsoup.org/cookbook/extracting-data/selector-syntax

xpath for the columns - //*[@id="phone_details"]/tbody/tr[3]/td[2]/strong

xpath for the values - //*[@id="phone_details"]/tbody/tr[3]/td[3]

@Joey's code tries to zero in on these. You should be able to write the select() rules based on the Xpath.

Replace the numbers (tr[N] / td[N]) with appropriate values.

Alternatively, you can pipe the HTML thought a text only browser and extract the data from the text. Here is the text version of the page. You can delimit the text or read after N chars to extract the data.

this is how i get the data from a html table.

org.jsoup.nodes.Element tablaRegistros = doc
                    .getElementById("tableId");
for (org.jsoup.nodes.Element row : tablaRegistros.select("tr")) {
                for (org.jsoup.nodes.Element column : row.select("td")) {
                    // Elements tds = row.select("td");
                    // cadena += tds.get(0).text() + "->" +
                    // tds.get(1).text()
                    // + " \n";
                    cadena += column.text() + ",";
                }
                cadena += "\n";
            }

Here is a generic solution to extraction of table from HTML page via JSoup.

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ExtractTableDataUsingJSoup {

    public static void main(String[] args) {
        extractTableUsingJsoup("http://mobilereviews.net/details-for-Motorola%20L7.htm","phone_details");
    }

    public static void extractTableUsingJsoup(String url, String tableId){
        Document doc;
        try {
            // need http protocol
            doc = Jsoup.connect(url).get();

            //Set id of any table from any website and the below code will print the contents of the table.
            //Set the extracted data in appropriate data structures and use them for further processing
            Element table = doc.getElementById(tableId);

            Elements tds = table.getElementsByTag("td");

            //You can check for nesting of tds if such structure exists
            for (Element td : tds) {
                System.out.println("\n"+td.text());
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM