简体   繁体   English

使用带有jsoup的java从html标签中提取值

[英]Extract values from html tags using java with jsoup

im new using jsoup library (jsoup-1.14.3)我是新使用 jsoup 库(jsoup-1.14.3)

i have this html我有这个 html

 <html><head><title>Alfresco Content Repository</title><style>body { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; } table { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; } .listingTable { border: solid black 1px; } .textCommand { font-family: verdana; font-size: 10pt; } .textLocation { font-family: verdana; font-size: 11pt; font-weight: bold; color: #2a568f; } .textData { font-family: verdana; font-size: 10pt; } .tableHeading { font-family: verdana; font-size: 10pt; font-weight: bold; color: white; background-color: #2a568f; } .rowOdd { background-color: #eeeeee; } .rowEven { background-color: #dddddd; } </style></head> <body> <table cellspacing='2' cellpadding='3' border='0' width='100%'> <tr><td colspan='4' class='textLocation'>Directory listing for /rep</td></tr> <tr><td height='10' colspan='4'></td></tr></table><table cellspacing='2' cellpadding='3' border='0' width='100%' class='listingTable'> <tr><td class='tableHeading' width='*'>Name</td><td class='tableHeading' width='10%'>Size</td><td class='tableHeading' width='20%'>Type</td><td class='tableHeading' width='25%'>Modified Date</td></tr> <tr class='rowOdd'><td class='textData'><a href="/alfresco/webdav/rep/ED">ED</a></td><td class='textData'>&nbsp;</td><td class='textData'>&nbsp;</td><td class='textData'>Thu, 05 Jan 2017 11:11:14 GMT</td></tr> <tr class='rowEven'><td class='textData'><a href="/alfresco/webdav/rep/FLOW%20CHART">FLOW CHART</a></td><td class='textData'>&nbsp;</td><td class='textData'>&nbsp;</td><td class='textData'>Thu, 27 Jun 2013 13:30:18 GMT</td></tr> <tr class='rowOdd'><td class='textData'><a href="/alfresco/webdav/rep/file">file</a></td><td class='textData'>&nbsp;</td><td class='textData'>&nbsp;</td><td class='textData'>Wed, 10 Nov 2021 13:16:49 GMT</td></tr> </table></body></html>

ANd , i'm trying to get the href of each tag .并且,我正在尝试获取每个标签的 href。

For example ,例如 ,

 <table cellspacing='2' cellpadding='3' border='0' width='100%'> <tr><td colspan='4' class='textLocation'>Directory listing for /rep</td></tr> <tr><td height='10' colspan='4'></td></tr></table><table cellspacing='2' cellpadding='3' border='0' width='100%' class='listingTable'> <tr><td class='tableHeading' width='*'>Name</td><td class='tableHeading' width='10%'>Size</td><td class='tableHeading' width='20%'>Type</td><td class='tableHeading' width='25%'>Modified Date</td></tr> <tr class='rowOdd'><td class='textData'><a href="/alfresco/webdav/rep/ED">ED</a></td><td class='textData'>&nbsp;</td><td class='textData'>&nbsp;</td><td class='textData'>Thu, 05 Jan 2017 11:11:14 GMT</td></tr>

I want to extract "/alfresco/webdav/rep/ED" and "ED" and "Thu, 05 Jan 2017 11:11:14 GMT"我想提取“/alfresco/webdav/rep/ED”“ED”“Thu, 05 Jan 2017 11:11:14 GMT”

First you need to parse the html which is String to Document.首先,您需要解析字符串到文档的 html。

final Document document = Jsoup.parse(html);

Then you need to select all tr tags which contains a tag.然后你需要选择所有tr包含标签a标签。

final Elements trElements = document.select("tr:has(a)");

After, you need to browse each tr tag found :之后,您需要浏览找到的每个tr标签:

for (final Element trElement : trElements) {
    //Do stuff
}

For each tr tag, you retrieve the href value of tag.对于每个 tr 标签,您检索标签的href值。 But first, you need to retrieve the a tag :但首先,您需要检索a标签:

final Element aElement = trElement.select("a").first();

Then, we retrieve, the value of href attribute in tag a .然后,我们检索标签ahref属性的值。

final String href = aElement.attr("href");

For name, you retrieve the text content of a tag :对于名称,您检索的文本内容a标签:

final String name = aElement.text();

For date, you need to retrieve the fourth td tag from tr tag :对于日期,您需要从tr tag 检索第四个td标签:

final Element dateTdElement = trElement.select("td").get(3);

And just retrieve the value text to get the date content :只需检索值文本即可获取日期内容:

final String date = dateTdElement.text();

NB : The method select() accept a css query.注意:方法select()接受 css 查询。 All css query is valid with extended syntax like ':has()' and other part.所有 css 查询都适用于扩展语法,如 ':has()' 和其他部分。 See Jsoup documention for more detail.有关更多详细信息,请参阅 Jsoup 文档。

To resume all in one code :要在一个代码中恢复所有内容:

public static void main(final String[] args) {
    final String html = "<html><head><title>Alfresco Content Repository</title><style>body { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; }\n" +
            "table { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; }\n" +
            ".listingTable { border: solid black 1px; }\n" +
            ".textCommand { font-family: verdana; font-size: 10pt; }\n" +
            ".textLocation { font-family: verdana; font-size: 11pt; font-weight: bold; color: #2a568f; }\n" +
            ".textData { font-family: verdana; font-size: 10pt; }\n" +
            ".tableHeading { font-family: verdana; font-size: 10pt; font-weight: bold; color: white; background-color: #2a568f; }\n" +
            ".rowOdd { background-color: #eeeeee; }\n" +
            ".rowEven { background-color: #dddddd; }\n" +
            "</style></head>\n" +
            "<body>\n" +
            "<table cellspacing='2' cellpadding='3' border='0' width='100%'>\n" +
            "<tr><td colspan='4' class='textLocation'>Directory listing for /rep</td></tr>\n" +
            "<tr><td height='10' colspan='4'></td></tr></table><table cellspacing='2' cellpadding='3' border='0' width='100%' class='listingTable'>\n" +
            "<tr><td class='tableHeading' width='*'>Name</td><td class='tableHeading' width='10%'>Size</td><td class='tableHeading' width='20%'>Type</td><td class='tableHeading' width='25%'>Modified Date</td></tr>\n" +
            "<tr class='rowOdd'><td class='textData'><a href=\"/alfresco/webdav/rep/ED\">ED</a></td><td class='textData'>&nbsp;</td><td class='textData'>&nbsp;</td><td class='textData'>Thu, 05 Jan 2017 11:11:14 GMT</td></tr>\n" +
            "<tr class='rowEven'><td class='textData'><a href=\"/alfresco/webdav/rep/FLOW%20CHART\">FLOW CHART</a></td><td class='textData'>&nbsp;</td><td class='textData'>&nbsp;</td><td class='textData'>Thu, 27 Jun 2013 13:30:18 GMT</td></tr>\n" +
            "<tr class='rowOdd'><td class='textData'><a href=\"/alfresco/webdav/rep/file\">file</a></td><td class='textData'>&nbsp;</td><td class='textData'>&nbsp;</td><td class='textData'>Wed, 10 Nov 2021 13:16:49 GMT</td></tr>\n" +
            "\n" +
            "\n" +
            "</table></body></html>";

    final Document document = Jsoup.parse(html);
    final Elements trElements = document.select("tr:has(a)");
    for (final Element trElement : trElements) {
        final Element aElement = trElement.select("a").first();
        final String href = aElement.attr("href");
        System.out.println("Href : " + href);

        final String name = aElement.text();
        System.out.println("Name : " + name);

        final Element dateTdElement = trElement.select("td").get(3);
        final String date = dateTdElement.text();
        System.out.println("Date : " + date);
    }
}

It prints something like :它打印如下内容:

Href : /alfresco/webdav/rep/ED
Name : ED
Date : Thu, 05 Jan 2017 11:11:14 GMT
Href : /alfresco/webdav/rep/FLOW%20CHART
Name : FLOW CHART
Date : Thu, 27 Jun 2013 13:30:18 GMT
Href : /alfresco/webdav/rep/file
Name : file
Date : Wed, 10 Nov 2021 13:16:49 GMT

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM