简体   繁体   English

如何使用Jsoup读取HTML表

[英]How to read an HTML table with Jsoup

I am trying to read the table with the cities from here 我正在尝试从这里与城市一起阅读桌子

Essential I want all the cities names but I am stuck at the part where i traverse to the inside of the table. 至关重要,我想要所有城市的名称,但我只能停留在桌子内部。

Select code. 选择代码。

 Element table = rawCities.getElementById("content")
                 .getElementById("bodyContent")
                 .getElementById("mw-content-text")
                 .select("table.wikitable sortable jquery-tablesorter").first()
                 `.select("tbody").first()`;

So the document is downloaded and parsed with Jsoup.connect in another class and here I am trying to get the city names. 因此,该文档已下载并在另一个类中与Jsoup.connect一起解析,在这里我试图获取城市名称。 When I traverse with selects I get a NullPointerException here. 当我遍历选择时,我在这里得到NullPointerException。 If I get rid of the .select("tbody").first() the program runs but debugger shows table variable null. 如果我摆脱了.select("tbody").first()程序运行,但调试器显示表变量为null。 Should I be doing this in an other way or did I get something wrong? 我应该以其他方式这样做还是我做错了什么?

If you print rawCities you will most probably not find any element which would represent tag <jquery-tablesorter> . 如果您打印rawCities ,则很可能找不到任何表示标签<jquery-tablesorter>元素。 So you should remove it from your select . 因此,您应该将其从select删除。

Another problem is that table.wikitable sortable will try to find 另一个问题是table.wikitable sortable将尝试查找

<table class="wikitable">
  ...
    <sortable>
  ...
</table>

not

<table class"wikitable sortable">...

To find element with few classes use . 要查找很少类的元素,请使用. operator before each class name like element.class1.class2 not space (which describes ancestor-child relationship) element.class1 class2 . 每个类名前面的运算符,例如element.class1.class2不要空格 (描述祖先与孩子的关系) element.class1 class2

So your code could be simplified to 因此您的代码可以简化为

Element table = rawCities
        .select("table.wikitable.sortable tbody")
        .first();

Anyway if you only want to print content of first column of selected table you can do it with 无论如何,如果您只想打印所选表的第一列的内容,则可以使用

for (Element row : rawCities.select("table.wikitable.sortable td:eq(0) a")) {
    System.out.println(row.text());
}

You can use this loop to also add results of row.text() to some List<String> created earlier or use code like 您还可以使用此循环将row.text()结果添加到先前创建的某些List<String>或使用类似以下的代码

List<String> names = rawCities
        .select("table.wikitable.sortable td:eq(0) a")
        .stream()
        .map(e -> e.text())
        .collect(Collectors.toList());

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM