简体   繁体   English

Jsoup无法从html表中提取格式

[英]Jsoup Trouble extracting formatting from html tables

<tr>

<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order 
Theorems</span>

</th><th bgcolor="PINK"> <em><a href="\ 
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2\] 
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax-- 
-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>

</th><th bgcolor="SKYBLUE"> <a href="\ 
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3\] 
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax-- 
-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>

</th><th bgcolor="LIME"> <a href="\ 
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3\] 
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III-- 
-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>

</th><th bgcolor="YELLOW"> <a href="\ 
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0\] 
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II-- 
-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>

</th></tr>

So lets say I want to extract bgcolor, align, and what is contained in the span class. 因此,可以说我想提取bgcolor,align和span类中包含的内容。 So for example GREY,LEFT,Higher-order Theorems. 因此,例如GREY,LEFT,高阶定理。

If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so? 如果我只想提取至少bgcolor,但理想情况下全部提取3,我该怎么做?

So I was attempting to extract just the bgcolor and 所以我试图只提取bgcolor和

I've tried doc.select("tr:contains([bgcolor]"), doc.select(th, [bgcolor), doc.select([bgcolor]), doc.select(tr:containsdata(bgcolor) , as well as doc.select([style]) and all have either returned no output or returned a parse error. I can extract the stuff in the span class just fine but it is more of a problem of also extracting bgcolor and align. 我已经尝试了doc.select(“ tr:contains([bgcolor]”),doc.select(th,[bgcolor],doc.select([bgcolor]),doc.select(tr:containsdata(bgcolor)),以及doc.select([style])都没有返回任何输出或返回了解析错误。我可以很好地提取span类中的内容,但它还具有提取bgcolor和align的问题。

You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. 您只需要使用JSOUP Elements中的attr选择器将要剪贴的HTML代码解析为JSOUP,然后选择所需的HTML标签的属性,即可为HTML中的每个标签提供该属性的值。 To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text() . 要同时检索span标签之间包含的文本,您需要选择th中的嵌套span并获取.text()

    Document document = Jsoup.parse(YOUT HTML GOES HERE);
    System.out.println(document);
    Elements elements = document.select("tr > th");

    for (Element element : elements) {
        String align = element.attr("align");
        String color = element.attr("bgcolor");
        String spanText = element.select("span").text();

        System.out.println("Align is " + align +
                "\nBackground Color is " + color +
                "\nSpan Text is " + spanText);
    }

For any further information feel free to ask me! 有关更多信息,请随时问我! Hope this helped you! 希望这对您有所帮助!

Updated Answer to comment: 更新了答案以评论:

To do that, you'll need to use this line inside the for each loop: 为此,您需要在每个循环的内部使用此行:

String fullText = element.text();

That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. 这样,您可以获取所选Element标记之间包含的所有文本,但是您应该查找此博客并使其适合您的查询。 I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals. 我猜您还需要检查String是否为空,并使用IF条件对每种可能的情况分别进行查询。

That implies having one for this structure: tr > th > span , another for this one: tr > th > em , and another for: tr > th . 这意味着对于该结构具有一个: tr> th> span ,对于该结构具有另一个: tr> th> em ,另一个对于: tr> th

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM