简体   繁体   English

使用Jsoup提取和解析HTML表

[英]Extract and Parse HTML Table using Jsoup

How could I use Jsoup to extract specification data from this website separately for each row eg Network->Network Type, Battery etc. 我如何使用Jsoup从此网站的每一行分别提取规范数据,例如“网络”->“网络类型”,“电池”等。

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class mobilereviews {
    public static void main(String[] args) throws Exception {
        Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
        for (Element table : doc.select("table")) {
            for (Element row : table.select("tr")) {
                Elements tds = row.select("td");
                System.out.println(tds.get(0).text());   
            }
        }
    }
}

Here is an attempt to find the solution to your problem 这是尝试找到您的问题的解决方案

Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();

for (Element table : doc.select("table[id=phone_details]")) {
     for (Element row : table.select("tr:gt(2)")) {
        Elements tds = row.select("td:not([rowspan])");
        System.out.println(tds.get(0).text() + "->" + tds.get(1).text());
     }
}

Parsing the HTML is tricky and if the HTML changes your code needs to change as well. 解析HTML非常棘手,如果HTML发生更改,您的代码也需要更改。

You need to study the HTML markup to come up with your parsing rules first. 您需要研究HTML标记才能首先提出解析规则。

  • There are multiple tables in the HTML, so you first filter on the correct one table[id=phone_details] HTML中有多个表格,因此您首先要过滤一个正确的table[id=phone_details]
  • The first 2 table rows contain only markup for formatting, so skip them tr:gt(2) 表的前2行仅包含用于格式设置的标记,因此请跳过它们tr:gt(2)
  • Every other row starts with the global description for the content type, filter it out td:not([rowspan]) 每隔一行以内容类型的全局描述开头,将其过滤掉td:not([rowspan])

For more complex options in the selector syntax, look here http://jsoup.org/cookbook/extracting-data/selector-syntax 有关选择器语法中更复杂的选项,请参见此处http://jsoup.org/cookbook/extracting-data/selector-syntax

xpath for the columns - //*[@id="phone_details"]/tbody/tr[3]/td[2]/strong 列的xpath- //*[@id="phone_details"]/tbody/tr[3]/td[2]/strong

xpath for the values - //*[@id="phone_details"]/tbody/tr[3]/td[3] 值的xpath- //*[@id="phone_details"]/tbody/tr[3]/td[3]

@Joey's code tries to zero in on these. @Joey的代码尝试将这些内容归零。 You should be able to write the select() rules based on the Xpath. 您应该能够基于Xpath编写select()规则。

Replace the numbers (tr[N] / td[N]) with appropriate values. 用适当的值替换数字(tr [N] / td [N])。

Alternatively, you can pipe the HTML thought a text only browser and extract the data from the text. 另外,您可以通过HTML将纯文本浏览器作为管道,并从文本中提取数据。 Here is the text version of the page. 这是页面的文本版本 You can delimit the text or read after N chars to extract the data. 您可以定界文本或在N个字符后读取以提取数据。

this is how i get the data from a html table. 这就是我从html表中获取数据的方式。

org.jsoup.nodes.Element tablaRegistros = doc
                    .getElementById("tableId");
for (org.jsoup.nodes.Element row : tablaRegistros.select("tr")) {
                for (org.jsoup.nodes.Element column : row.select("td")) {
                    // Elements tds = row.select("td");
                    // cadena += tds.get(0).text() + "->" +
                    // tds.get(1).text()
                    // + " \n";
                    cadena += column.text() + ",";
                }
                cadena += "\n";
            }

Here is a generic solution to extraction of table from HTML page via JSoup. 这是通过JSoup从HTML页面提取表的通用解决方案。

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ExtractTableDataUsingJSoup {

    public static void main(String[] args) {
        extractTableUsingJsoup("http://mobilereviews.net/details-for-Motorola%20L7.htm","phone_details");
    }

    public static void extractTableUsingJsoup(String url, String tableId){
        Document doc;
        try {
            // need http protocol
            doc = Jsoup.connect(url).get();

            //Set id of any table from any website and the below code will print the contents of the table.
            //Set the extracted data in appropriate data structures and use them for further processing
            Element table = doc.getElementById(tableId);

            Elements tds = table.getElementsByTag("td");

            //You can check for nesting of tds if such structure exists
            for (Element td : tds) {
                System.out.println("\n"+td.text());
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM