[英]Extract values from html tags using java with jsoup
im new using jsoup library (jsoup-1.14.3)我是新使用 jsoup 库(jsoup-1.14.3)
i have this html我有这个 html
<html><head><title>Alfresco Content Repository</title><style>body { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; } table { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; } .listingTable { border: solid black 1px; } .textCommand { font-family: verdana; font-size: 10pt; } .textLocation { font-family: verdana; font-size: 11pt; font-weight: bold; color: #2a568f; } .textData { font-family: verdana; font-size: 10pt; } .tableHeading { font-family: verdana; font-size: 10pt; font-weight: bold; color: white; background-color: #2a568f; } .rowOdd { background-color: #eeeeee; } .rowEven { background-color: #dddddd; } </style></head> <body> <table cellspacing='2' cellpadding='3' border='0' width='100%'> <tr><td colspan='4' class='textLocation'>Directory listing for /rep</td></tr> <tr><td height='10' colspan='4'></td></tr></table><table cellspacing='2' cellpadding='3' border='0' width='100%' class='listingTable'> <tr><td class='tableHeading' width='*'>Name</td><td class='tableHeading' width='10%'>Size</td><td class='tableHeading' width='20%'>Type</td><td class='tableHeading' width='25%'>Modified Date</td></tr> <tr class='rowOdd'><td class='textData'><a href="/alfresco/webdav/rep/ED">ED</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 05 Jan 2017 11:11:14 GMT</td></tr> <tr class='rowEven'><td class='textData'><a href="/alfresco/webdav/rep/FLOW%20CHART">FLOW CHART</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 27 Jun 2013 13:30:18 GMT</td></tr> <tr class='rowOdd'><td class='textData'><a href="/alfresco/webdav/rep/file">file</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Wed, 10 Nov 2021 13:16:49 GMT</td></tr> </table></body></html>
ANd , i'm trying to get the href of each tag .并且,我正在尝试获取每个标签的 href。
For example ,例如 ,
<table cellspacing='2' cellpadding='3' border='0' width='100%'> <tr><td colspan='4' class='textLocation'>Directory listing for /rep</td></tr> <tr><td height='10' colspan='4'></td></tr></table><table cellspacing='2' cellpadding='3' border='0' width='100%' class='listingTable'> <tr><td class='tableHeading' width='*'>Name</td><td class='tableHeading' width='10%'>Size</td><td class='tableHeading' width='20%'>Type</td><td class='tableHeading' width='25%'>Modified Date</td></tr> <tr class='rowOdd'><td class='textData'><a href="/alfresco/webdav/rep/ED">ED</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 05 Jan 2017 11:11:14 GMT</td></tr>
I want to extract "/alfresco/webdav/rep/ED" and "ED" and "Thu, 05 Jan 2017 11:11:14 GMT"我想提取“/alfresco/webdav/rep/ED”和“ED”和“Thu, 05 Jan 2017 11:11:14 GMT”
First you need to parse the html which is String to Document.首先,您需要解析字符串到文档的 html。
final Document document = Jsoup.parse(html);
Then you need to select all tr
tags which contains a
tag.然后你需要选择所有
tr
包含标签a
标签。
final Elements trElements = document.select("tr:has(a)");
After, you need to browse each tr
tag found :之后,您需要浏览找到的每个
tr
标签:
for (final Element trElement : trElements) {
//Do stuff
}
For each tr tag, you retrieve the href
value of tag.对于每个 tr 标签,您检索标签的
href
值。 But first, you need to retrieve the a
tag :但首先,您需要检索
a
标签:
final Element aElement = trElement.select("a").first();
Then, we retrieve, the value of href
attribute in tag a
.然后,我们检索标签
a
中href
属性的值。
final String href = aElement.attr("href");
For name, you retrieve the text content of a
tag :对于名称,您检索的文本内容
a
标签:
final String name = aElement.text();
For date, you need to retrieve the fourth td
tag from tr
tag :对于日期,您需要从
tr
tag 检索第四个td
标签:
final Element dateTdElement = trElement.select("td").get(3);
And just retrieve the value text to get the date content :只需检索值文本即可获取日期内容:
final String date = dateTdElement.text();
NB : The method select()
accept a css query.注意:方法
select()
接受 css 查询。 All css query is valid with extended syntax like ':has()' and other part.所有 css 查询都适用于扩展语法,如 ':has()' 和其他部分。 See Jsoup documention for more detail.
有关更多详细信息,请参阅 Jsoup 文档。
To resume all in one code :要在一个代码中恢复所有内容:
public static void main(final String[] args) {
final String html = "<html><head><title>Alfresco Content Repository</title><style>body { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; }\n" +
"table { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; }\n" +
".listingTable { border: solid black 1px; }\n" +
".textCommand { font-family: verdana; font-size: 10pt; }\n" +
".textLocation { font-family: verdana; font-size: 11pt; font-weight: bold; color: #2a568f; }\n" +
".textData { font-family: verdana; font-size: 10pt; }\n" +
".tableHeading { font-family: verdana; font-size: 10pt; font-weight: bold; color: white; background-color: #2a568f; }\n" +
".rowOdd { background-color: #eeeeee; }\n" +
".rowEven { background-color: #dddddd; }\n" +
"</style></head>\n" +
"<body>\n" +
"<table cellspacing='2' cellpadding='3' border='0' width='100%'>\n" +
"<tr><td colspan='4' class='textLocation'>Directory listing for /rep</td></tr>\n" +
"<tr><td height='10' colspan='4'></td></tr></table><table cellspacing='2' cellpadding='3' border='0' width='100%' class='listingTable'>\n" +
"<tr><td class='tableHeading' width='*'>Name</td><td class='tableHeading' width='10%'>Size</td><td class='tableHeading' width='20%'>Type</td><td class='tableHeading' width='25%'>Modified Date</td></tr>\n" +
"<tr class='rowOdd'><td class='textData'><a href=\"/alfresco/webdav/rep/ED\">ED</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 05 Jan 2017 11:11:14 GMT</td></tr>\n" +
"<tr class='rowEven'><td class='textData'><a href=\"/alfresco/webdav/rep/FLOW%20CHART\">FLOW CHART</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 27 Jun 2013 13:30:18 GMT</td></tr>\n" +
"<tr class='rowOdd'><td class='textData'><a href=\"/alfresco/webdav/rep/file\">file</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Wed, 10 Nov 2021 13:16:49 GMT</td></tr>\n" +
"\n" +
"\n" +
"</table></body></html>";
final Document document = Jsoup.parse(html);
final Elements trElements = document.select("tr:has(a)");
for (final Element trElement : trElements) {
final Element aElement = trElement.select("a").first();
final String href = aElement.attr("href");
System.out.println("Href : " + href);
final String name = aElement.text();
System.out.println("Name : " + name);
final Element dateTdElement = trElement.select("td").get(3);
final String date = dateTdElement.text();
System.out.println("Date : " + date);
}
}
It prints something like :它打印如下内容:
Href : /alfresco/webdav/rep/ED
Name : ED
Date : Thu, 05 Jan 2017 11:11:14 GMT
Href : /alfresco/webdav/rep/FLOW%20CHART
Name : FLOW CHART
Date : Thu, 27 Jun 2013 13:30:18 GMT
Href : /alfresco/webdav/rep/file
Name : file
Date : Wed, 10 Nov 2021 13:16:49 GMT
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.