简体   繁体   English

使用 jsoup 进行网页抓取仅返回表格的一部分

[英]Web scraping with jsoup is returning only part of the table

I am new at coding.我是编码新手。
I'm trying to webscrape a table with a list of funds from a broker's website.我正在尝试从经纪人的网站上抓取一张包含资金列表的表格。 The code is working fine but ut is returning only part of the list (a bit more then the first half of the list), and I can't find out why代码工作正常,但 ut 只返回列表的一部分(比列表的前半部分多一点),我不知道为什么

I've already checked the html strucutre and the tags and everything seems to be right...我已经检查了 html 结构和标签,一切似乎都是正确的......

int count = 0;
String URL = "https://institucional.xpi.com.br/investimentos/fundos-de-investimento/lista-de-fundos-de-investimento.aspx";

try {
    Document doc = Jsoup.connect(URL).userAgent("Mozilla/17.0").get();
            
for (Element table: doc.select("#tableTodos tr")) {
    Elements tds = table.getElementsByTag("td");
    if (tds.size() > 0) {
        count++;
        System.out.println(count + " - " + tds.get(2).text());
}

This is the final part of the return这是返回的最后一部分

138 - Kapitalo Kappa FIN FIC FIM
139 - Kapitalo Tarkus FIC FIA
140 - Kinea Atlas II FIM
141 - Kinea Chronos FIM
142 - Kinea RF Absoluto FI LP
143 - Leblon Ações FIC FIA
144 - Legacy Capital Advisory FIC FIM
145 - Legg Mason Clearbridge US Large Cap Growth FIA IE
146 - Legg Mason Martin Currie European Absolute Alpha FIM IE
147 - Mauá Capital Ações FIC FIA

It goes only to 147, the table at the website has more than 300 rows...它只有147,网站上的表格有300多行......

you should search the elements in the table by "tr" tag and not td.您应该通过“tr”标签而不是 td 搜索表中的元素。 It will give you all the rows in the table.它会给你表中的所有行。 then, in each row, search for the td and print it's text.然后,在每一行中,搜索 td 并打印它的文本。

EDIT 1:编辑 1:

    ChromeOptions chromeOptions = new ChromeOptions();
    chromeOptions.addArguments("--headless");
    chromeOptions.addArguments("--user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.2840.100 Safari/537.36");
    driver = new ChromeDriver(chromeOptions);
    chromeDriver.get("https://institucional.xpi.com.br/investimentos/fundos-de-investimento/lista-de-fundos-de-investimento.aspx");
    List<WebElement> elements = chromeDriver.findElement(By.xpath("//*[@id=\"tableTodos\"]")).findElements(By.tagName("tr"));
    System.out.println(elements.get(200).getText());

EDIT 2:编辑2:

Add maxBodySize yo your get call:添加 maxBodySize yo 你的 get 电话:

Document doc = Jsoup.connect(URL).timeout(0).maxBodySize(0).get();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM