简体   繁体   English

如何使用Jsoup选择在同一树级别存在的两个(或更多)HTML元素?

[英]How to select two (or more) HTML elements that exist at the same tree level with Jsoup?

I'm working on a project and I faced a problem. 我正在做一个项目,但遇到了问题。 I need to scrape data from the website that contains following HTML code: 我需要从包含以下HTML代码的网站上抓取数据:

<div class="lin-curso" style="border: 0;">
    <div class="lin-area-c3">
        Vagas 2017
    </div>
</div>
<div class="box10">
    <div class="lin-area-c1">
        L160
    </div>
    <div class="lin-area-c2">
        Acupuntura
    </div>
    <div class="lin-area-c3">
        [Lic-1º cic]
    </div>
</div>
<div class="lin-curso">
    <div class="lin-curso-c1">
        &nbsp;
    </div>
    <div class="lin-curso-c2">
        3155
    </div>
    <div class="lin-curso-c3">
        <a href="detcursopi.asp?codc=L160&amp;code=3155" title="3155/L160">Instituto Politécnico de Setúbal - Escola Superior de Saúde</a>
    </div>
    <div class="lin-curso-c4">
        20
    </div>
</div>
<br>
<div class="box10">
    <div class="lin-area-c1">
        9059
    </div>
    <div class="lin-area-c2">
        Administração e Gestão de Empresas
    </div>
    <div class="lin-area-c3">
        [Lic-1º cic]
    </div>
</div>
<div class="lin-curso">
    <div class="lin-curso-c1">
        &nbsp;
    </div>
    <div class="lin-curso-c2">
        2270
    </div>
    <div class="lin-curso-c3">
        <a href="detcursopi.asp?codc=9059&amp;code=2270" title="2270/9059">Universidade Católica Portuguesa - Faculdade de Ciências Económicas e Empresariais</a>
    </div>
    <div class="lin-curso-c4">
        n.d.
    </div>
</div>
<br>
<div class="box10">
    <div class="lin-area-c1">
        8056
    </div>
    <div class="lin-area-c2">
        Administração e Gestão Pública
    </div>
    <div class="lin-area-c3">
        [Lic-1º cic]
    </div>
</div>
<div class="lin-curso">
    <div class="lin-curso-c1">
        &nbsp;
    </div>
    <div class="lin-curso-c2">
        4275
    </div>
    <div class="lin-curso-c3">
        <a href="detcursopi.asp?codc=8056&amp;code=4275" title="4275/8056">Instituto Superior de Ciências da Administração</a>
    </div>
    <div class="lin-curso-c4">
        20
    </div>
</div>
<br>
<div class="box10">
    <div class="lin-area-c1">
        8194
    </div>
    <div class="lin-area-c2">
        Administração da Guarda Nacional Republicana
    </div>
    <div class="lin-area-c3">
        [Mest Integ]
    </div>
</div>
<div class="lin-curso">
    <div class="lin-curso-c1">
        &nbsp;
    </div>
    <div class="lin-curso-c2">
        7510
    </div>
    <div class="lin-curso-c3">
        <a href="detcursopi.asp?codc=8194&amp;code=7510" title="7510/8194">Academia Militar</a>
    </div>
    <div class="lin-curso-c4">
        n.d.
    </div>
</div>
<br>
<div class="box10">
    <div class="lin-area-c1">
        9672
    </div>
    <div class="lin-area-c2">
        Administração e Marketing
    </div>
    <div class="lin-area-c3">
        [Lic-1º cic]
    </div>
</div>

BOX10 and line-curso should form an element and they don't. BOX10和line-curso应该构成一个元素,但它们不是。 Because in some lines there is only one BOX10 for one Lin-curso but there are lines that are like Lin-curso for one Box10 , if Box10 and Lin-curso were an element there wouldn't be a problem , is there a way I can associate those two ? 因为在某些行中,一个Lin-curso只有一个BOX10,但是对于一个Box10,却有像Lin-curso这样的行,如果Box10和Lin-curso是一个元素,那不会有问题,有没有办法能把这两个联系在一起吗?

EDIT: The website link is this : http://www.dges.gov.pt/guias/indcurso.asp?letra=A 编辑:网站链接是这样的: http ://www.dges.gov.pt/guias/indcurso.asp?letra =A

And the element is the ".inside" 元素是“ .inside”

Solution to this problem is fairly easy when you use sibling selector. 使用同级选择器时,此问题的解决方案相当简单。 In your case div with class box10 plays role of a header in the table and sibling divs with class lin-curso play role of table data rows. 在您的情况下,具有box10类的div充当表中的标题,而具有lin-curso类的同级div充当表数据行。 I would suggest firstly selecting all divs with class box10 : 我建议首先选择所有带class box10 div:

Elements boxes = doc.select("div.box10");

Then you can iterate over boxes and do two major things: 然后,您可以遍历boxes并做两件事:

  1. Extract data you are interested in from this div (it contains 3 child nodes, divs with classes lin-area-c1 , lin-area-c2 and lin-area-c3 ) 从此div中提取您感兴趣的数据(它包含3个子节点,具有lin-area-c1lin-area-c2lin-area-c3类的div)
  2. Select sibling nodes with class lin-curso and extract data from them. 选择具有lin-curso类的同级节点,并从中提取数据。

Jsoup provides a method called Element.nextElementSibling() that return sibling element to the element you called this method on. Jsoup提供了一个称为Element.nextElementSibling()方法,该方法将同级元素返回到调用此方法的元素。 So when you call it on element div.box10 you will get sibling element div.lin-curso . 因此,当您在元素div.box10上调用它时,您将获得兄弟元素div.lin-curso

Sibling in this case means a node immediately following the specified node at the same tree level. 在这种情况下, 同级意味着在同一树级别上紧随指定节点之后的节点。

Exemplary solution 示例性解决方案

Below you can find exemplary code that parses given website and prints table to the console output: 在下面,您可以找到解析给定网站并将表打印到控制台输出的示例代码:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

final class TestMain {

    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.connect("http://www.dges.gov.pt/guias/indcurso.asp?letra=A").get();

        Elements boxes = doc.select("div.box10");

        for (Element box : boxes) {
            String linAreaC1 = box.select(".lin-area-c1").text();
            String linAreaC2 = box.select(".lin-area-c2").text();
            String linAreaC3 = box.select(".lin-area-c3").text();

            System.out.printf("%s: %s %s%n", linAreaC1, linAreaC2, linAreaC3);

            Element linCurso = box.nextElementSibling();

            while (linCurso.hasClass("lin-curso")) {
                String linCursoC2 = linCurso.select(".lin-curso-c2").text();
                String linCursoC3 = linCurso.select(".lin-curso-c3").text();
                String linCursoC4 = linCurso.select(".lin-curso-c4").text();

                System.out.printf("%s\t%s\t%s%n", linCursoC2, linCursoC3, linCursoC4);

                linCurso = linCurso.nextElementSibling();
            }

            System.out.println("==============================");
        }
    }
}

I hope it helps. 希望对您有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM