简体   繁体   English

如何使用JSoup从网站获取多个表

[英]How to get multiple tables using JSoup from a website

I need to get all 9 tables off: 我需要拿走所有9张桌子:

https://www.basketball-reference.com/players/c/collijo01.html https://www.basketball-reference.com/players/c/collijo01.html

My current code only does 1 table. 我当前的代码只做1张桌子。 I switch .first() to .last() which doesn't work. 我将.first()切换为.last()无效。 I tried using ("table.totals") to grab a table by name but that also failed. 我尝试使用(“ table.totals”)来按名称获取表,但这也失败了。

public static void getData(String url) throws IOException
{
    String fileName = "table.csv";
    FileWriter writer = new FileWriter(fileName);
    Document doc = Jsoup.connect(url).get();
    Element tableElement = doc.select("table").first();

    System.out.println(doc);

    Elements tableHeaderEles = tableElement.select("thead tr th");
    for (int i = 0; i < tableHeaderEles.size(); i++) {
        writer.append(tableHeaderEles.get(i).text());

        if(i != tableHeaderEles.size() -1){             
            writer.append(',');
        }
    }
    writer.append('\n');
    System.out.println();

    Elements tableRowElements = tableElement.select(":not(thead) tr");

    for (int i = 0; i < tableRowElements.size(); i++) {
        Element row = tableRowElements.get(i);
        Elements rowItems = row.select("td");
        for (int j = 0; j < rowItems.size(); j++) {
            writer.append(rowItems.get(j).text());

            if(j != rowItems.size() -1){
                writer.append(',');
            }
        }
        writer.append('\n');
    }

    writer.close();
}

I get the first table from the site perfectly, but unable to advance past that. 我从网站上获得了第一张桌子,但是无法超越。 Does anyone know how to get all tables or grab tables based on ID? 有人知道如何根据ID获取所有表或获取表吗?

EDIT: if anyone wants to completely test this coding's outputs for themselves 编辑:如果有人想自己完整地测试此编码的输出

 public static void read(String file) throws IOException
 {
    Scanner scanner = new Scanner(new File(file));
    scanner.useDelimiter(",");
    while(scanner.hasNext()){
        System.out.print(scanner.next()+"|");
    }
    scanner.close();
}

You've already selected all tables but you're explicitly getting only the first one: 您已经选择了所有表,但是明确只得到第一个表:

Element tableElement = doc.select("table").first();

Instead you can easily iterate over all of them: 相反,您可以轻松地遍历所有这些对象:

Elements tableElements = doc.select("table");
for (Element tableElement : tableElements) {
   // for each of selected tables
}

So after some modifications to get unique filename the code will look like this: 因此,经过一些修改以获取唯一的文件名后,代码将如下所示:


public static void getData(String url) throws IOException {
    String html = Jsoup.connect(url).execute().body();
    // this one is tricky as it contains tables as commented out HTML, and shows them using javascript code
    // so I'm using dirty replace to remove comment tags before parsing to make tables visible to Jsoup
    html = html.replaceAll("<!--", "");
    html = html.replaceAll("-->", "");
    Document doc = Jsoup.parse(html);
    Elements tableElements = doc.select("table");
    int number = 1;
    for (Element tableElement : tableElements) {
        String tableId = tableElement.id();
        if (tableId.isEmpty()) {
            // skip table without id
            continue;
        }
        tableId = " with id " + tableId;
        String fileName = "table" + number++ + tableId + ".csv";
        FileWriter writer = new FileWriter(fileName);

        System.out.println(doc);

        Elements tableHeaderEles = tableElement.select("thead tr th");
        for (int i = 0; i < tableHeaderEles.size(); i++) {
            writer.append(tableHeaderEles.get(i).text());

            if (i != tableHeaderEles.size() - 1) {
                writer.append(',');
            }
        }
        writer.append('\n');
        System.out.println();

        Elements tableRowElements = tableElement.select(":not(thead) tr");

        for (int i = 0; i < tableRowElements.size(); i++) {
            Element row = tableRowElements.get(i);
            Elements rowItems = row.select("td");
            for (int j = 0; j < rowItems.size(); j++) {
                writer.append(rowItems.get(j).text());

                if (j != rowItems.size() - 1) {
                    writer.append(',');
                }
            }
            writer.append('\n');
        }

        writer.close();
    }
}

Answering your second question: 回答第二个问题:

grab tables based on ID 根据ID获取表

Instead of selecting first table of all tables: 而不是选择所有表中的第一个表:

Element tableElement = doc.select("table").first();

select first table of table with id advanced : 选择ID为advanced的表的第一个表:

Element tableElement = doc.select("table#advanced").first();

Additional advice: The things you give as parameters to select(...) are CSS selectors . 附加建议:作为参数提供给select(...)CSS选择器

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM