[英]Extract and Parse HTML Table using Jsoup
How could I use Jsoup to extract specification data from this website separately for each row eg Network->Network Type, Battery etc. 我如何使用Jsoup从此网站的每一行分别提取规范数据,例如“网络”->“网络类型”,“电池”等。
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class mobilereviews {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
for (Element table : doc.select("table")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.get(0).text());
}
}
}
}
Here is an attempt to find the solution to your problem 这是尝试找到您的问题的解决方案
Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
for (Element table : doc.select("table[id=phone_details]")) {
for (Element row : table.select("tr:gt(2)")) {
Elements tds = row.select("td:not([rowspan])");
System.out.println(tds.get(0).text() + "->" + tds.get(1).text());
}
}
Parsing the HTML is tricky and if the HTML changes your code needs to change as well. 解析HTML非常棘手,如果HTML发生更改,您的代码也需要更改。
You need to study the HTML markup to come up with your parsing rules first. 您需要研究HTML标记才能首先提出解析规则。
table[id=phone_details]
table[id=phone_details]
tr:gt(2)
tr:gt(2)
td:not([rowspan])
td:not([rowspan])
For more complex options in the selector syntax, look here http://jsoup.org/cookbook/extracting-data/selector-syntax 有关选择器语法中更复杂的选项,请参见此处http://jsoup.org/cookbook/extracting-data/selector-syntax
xpath for the columns - //*[@id="phone_details"]/tbody/tr[3]/td[2]/strong
列的xpath-
//*[@id="phone_details"]/tbody/tr[3]/td[2]/strong
xpath for the values - //*[@id="phone_details"]/tbody/tr[3]/td[3]
值的xpath-
//*[@id="phone_details"]/tbody/tr[3]/td[3]
@Joey's code tries to zero in on these. @Joey的代码尝试将这些内容归零。 You should be able to write the
select()
rules based on the Xpath. 您应该能够基于Xpath编写
select()
规则。
Replace the numbers (tr[N] / td[N]) with appropriate values. 用适当的值替换数字(tr [N] / td [N])。
Alternatively, you can pipe the HTML thought a text only browser and extract the data from the text. 另外,您可以通过HTML将纯文本浏览器作为管道,并从文本中提取数据。 Here is the text version of the page.
这是页面的文本版本 。 You can delimit the text or read after N chars to extract the data.
您可以定界文本或在N个字符后读取以提取数据。
this is how i get the data from a html table. 这就是我从html表中获取数据的方式。
org.jsoup.nodes.Element tablaRegistros = doc
.getElementById("tableId");
for (org.jsoup.nodes.Element row : tablaRegistros.select("tr")) {
for (org.jsoup.nodes.Element column : row.select("td")) {
// Elements tds = row.select("td");
// cadena += tds.get(0).text() + "->" +
// tds.get(1).text()
// + " \n";
cadena += column.text() + ",";
}
cadena += "\n";
}
Here is a generic solution to extraction of table from HTML page via JSoup. 这是通过JSoup从HTML页面提取表的通用解决方案。
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ExtractTableDataUsingJSoup {
public static void main(String[] args) {
extractTableUsingJsoup("http://mobilereviews.net/details-for-Motorola%20L7.htm","phone_details");
}
public static void extractTableUsingJsoup(String url, String tableId){
Document doc;
try {
// need http protocol
doc = Jsoup.connect(url).get();
//Set id of any table from any website and the below code will print the contents of the table.
//Set the extracted data in appropriate data structures and use them for further processing
Element table = doc.getElementById(tableId);
Elements tds = table.getElementsByTag("td");
//You can check for nesting of tds if such structure exists
for (Element td : tds) {
System.out.println("\n"+td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.