Jsoup无法从html表中提取格式

Question

<tr>

<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order 
Theorems</span>

</th><th bgcolor="PINK"> <em><a href="\ 
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2\] 
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax-- 
-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>

</th><th bgcolor="SKYBLUE"> <a href="\ 
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3\] 
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax-- 
-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>

</th><th bgcolor="LIME"> <a href="\ 
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3\] 
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III-- 
-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>

</th><th bgcolor="YELLOW"> <a href="\ 
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0\] 
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II-- 
-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>

</th></tr>

因此，可以说我想提取bgcolor，align和span类中包含的内容。 因此，例如GREY，LEFT，高阶定理。

如果我只想提取至少bgcolor，但理想情况下全部提取3，我该怎么做？

所以我试图只提取bgcolor和

我已经尝试了doc.select（“ tr：contains（[bgcolor]”），doc.select（th，[bgcolor]，doc.select（[bgcolor]），doc.select（tr：containsdata（bgcolor）），以及doc.select（[style]）都没有返回任何输出或返回了解析错误。我可以很好地提取span类中的内容，但它还具有提取bgcolor和align的问题。

Answer 1

您只需要使用JSOUP Elements中的attr选择器将要剪贴的HTML代码解析为JSOUP，然后选择所需的HTML标签的属性，即可为HTML中的每个标签提供该属性的值。 要同时检索span标签之间包含的文本，您需要选择th中的嵌套span并获取.text（） 。

    Document document = Jsoup.parse(YOUT HTML GOES HERE);
    System.out.println(document);
    Elements elements = document.select("tr > th");

    for (Element element : elements) {
        String align = element.attr("align");
        String color = element.attr("bgcolor");
        String spanText = element.select("span").text();

        System.out.println("Align is " + align +
                "\nBackground Color is " + color +
                "\nSpan Text is " + spanText);
    }

有关更多信息，请随时问我！ 希望这对您有所帮助！

更新了答案以评论：

为此，您需要在每个循环的内部使用此行：

String fullText = element.text();

这样，您可以获取所选Element标记之间包含的所有文本，但是您应该查找此博客并使其适合您的查询。 我猜您还需要检查String是否为空，并使用IF条件对每种可能的情况分别进行查询。

这意味着对于该结构具有一个： tr> th> span ，对于该结构具有另一个： tr> th> em ，另一个对于： tr> th 。

Jsoup无法从html表中提取格式

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-11-26 11:04:14

Jsoup无法从html表中提取格式

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-11-26 11:04:14

解决方案1
0 已采纳 2018-11-26 11:04:14