简体   繁体   English

在Java中使用jsoup解析元素

[英]Parsing element using jsoup in Java

I am using jsoup in my java app to parse html code but now I need parse table data and I want to get the first value of the first <td> element under <tr> and after that if the first data contains the word "Outdated" it will skip and if there is no outdated it will parse to the 3rd table and get the value with ".rpm" word and can not get it to work. 我在Java应用程序中使用jsoup解析html代码,但现在我需要解析表数据,我想获取<tr>下的第一个<td>元素的第一个值,然后,如果第一个数据包含单词“ Outdated “它将跳过,如果没有过期,它将解析到第3个表,并使用“ .rpm”字来获取值,并且无法使其正常工作。 I try many ways but not successful so I want try luck here if anyone have experience. 我尝试了很多方法,但都没有成功,所以如果有人有经验,我想在这里试试运气。

public class rpms {

    public static void getTdSibling(String sourceTd) throws FileNotFoundException, UnsupportedEncodingException {
        String fragment = sourceTd;
        Document doc = Jsoup.parseBodyFragment(fragment);
        Elements myElements = doc.getElementsByClass("confluenceTable tablesorter").first().getElementsByTag("tr");
        for (Element element : myElements) {
            if (element.select("td").contains("Outdated")) {
                String rpms = element.ownText();
                System.out.println(rpms);
            }
        }
    }

    public static void main(String[] args) {
        URLget rpms = new URLget();
        try {
            getTdSibling(sendGetRequest(URL).toString());

        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

And please see the HTML code under table where the parsing of element happens below : 并且请参见下表中的HTML代码,其中元素的解析如下:

<table class="confluenceTable tablesorter">
    <tbody class="">
        <tr>
            <td colspan="1" class="confluenceTd">RHSA-2014:1172</td>
            <td colspan="1" class="confluenceTd">
                <p>The procmail program is used for local mail delivery. In addition to just
                    <br>delivering mail, procmail can be used for automatic filtering, presorting,
                    <br>and other mail handling jobs.</p>
                <p>A heap-based buffer overflow flaw was found in procmail's formail utility.
                    <br>A remote attacker could send an email with specially crafted headers that,
                    <br>when processed by formail, could cause procmail to crash or, possibly,
                    <br>execute arbitrary code as the user running formail. (CVE-2014-3618)
                </p>
            </td>
            <td colspan="1" class="confluenceTd">procmail-3.22-17.1.2.x86_64.rpm</td>
            <td colspan="1" class="confluenceTd">
                <img class="emoticon emoticon-tick" src="/s/en_GB-1988229788/4733/f235dd088df5682b0560ab6fc66ed22c9124c0be.57/_/images/icons/emoticons/check.png" data-emoticon-name="tick" alt="(tick)">
            </td>
        </tr>

        <tr>
            <td colspan="1" class="confluenceTd">Outdated RHSA-2014:1166</td>
            <td colspan="1" class="confluenceTd">
                <p>Jakarta Commons HTTPClient implements the client side of HTTP standards.</p>
                <p>It was discovered that the HTTPClient incorrectly extracted host name from
                    <br>an X.509 certificate subject's Common Name (CN) field. A man-in-the-middle
                    <br>attacker could use this flaw to spoof an SSL server using a specially
                    <br>crafted X.509 certificate. (CVE-2014-3577)</p>
            </td>
            <td colspan="1" class="confluenceTd">
                <p>jakarta-commons-httpclient-3.0-7jpp.4.el5_10.x86_64.rpm</p>
                <p>jakarta-commons-httpclient-demo-3.0-7jpp.4.el5_10.x86_64.rpm</p>
                <p>jakarta-commons-httpclient-javadoc-3.0-7jpp.4.el5_10.x86_64.rpm</p>
                <p>jakarta-commons-httpclient-manual-3.0-7jpp.4.el5_10.x86_64.rpm</p>
            </td>
        </tr>

        <tr>
            <td colspan="1" class="confluenceTd">RHSA-2014:1148-1</td>
            <td colspan="1" class="confluenceTd">
                <p>A flaw was found in the way Squid handled malformed HTTP Range headers.
                    <br>A remote attacker able to send HTTP requests to the Squid proxy could use
                    <br>this flaw to crash Squid. (CVE-2014-3609)
                </p>
                <p>A buffer overflow flaw was found in Squid's DNS lookup module. A remote
                    <br>attacker able to send HTTP requests to the Squid proxy could use this flaw
                    <br>to crash Squid. (CVE-2013-4115)</p>
            </td>
            <td colspan="1" class="confluenceTd"><span>squid-2.6.STABLE21-7.el5_10.x86_64.rpm</span>
            </td>
            <td colspan="1" class="confluenceTd"></td>
        </tr>
</table>

Need your help. 需要你的帮助。 I have tried many times and read articles from here but it can't. 我已经尝试了很多次,并且从这里阅读过文章,但是没有。 Thank you. 谢谢。

Be careful with your element's accessors (see documentation here ): 请注意元素的访问器(请参阅此处的文档):

You can only give one class to getElementsByClass 您只能给getElementsByClass一个类

public static void getTdSibling(String sourceTd) throws FileNotFoundException, UnsupportedEncodingException {
    String fragment = sourceTd;
    Document doc = Jsoup.parseBodyFragment(fragment);
    Elements myElements = doc.getElementsByClass("confluenceTable").first().getElementsByTag("tr");
    for (Element element : myElements) {
        // select the TDs
        Elements tds = element.getElementsByTag("td");
        // do you condition here
        if (tds.first().text().contains("Outdated")) {
            // access the <p> children of the 3rd td
            Elements rpms = tds.get(2).children();
            for (Element rpm : rpms) {
                if (rpm.text().contains(".rpm")) {
                    System.out.println(rpm.text());
                }
            }
        }
    }
}

Edited, now access the 3rd td in a row. 编辑后,现在连续访问第三个td。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM