简体   繁体   中英

Parising HTML: Getting particular html definition list after a particular paragraph using JSOUP

I am trying to get the content of a definition list (or any tag) after a particular tag satisfying my condition using JSoup with Java. As an example let's assume we have an html document as follows.

<p>PageID: 2816; NS: 0; Title: some text; 
Image url: 
Content:
{{Wort der Woche}}
{{Siehe auch}}
</p>
<h2><span class="1" id="e1">some text</span></h2>
<h3><span class="1" id="e2">some text</span></h3>

<p>{{Transportation}}
</p>
<dl>
    <dd>Flying</dd>
    <dd>Driving</dd>
    <dd>Sailing
        <dl>
            <dd>Boat</dd>
            <dd>Ship</dd>
        </dl>
    </dd>
</dl>

<p>{{Activities}}
</p>
<dl>
    <dd>Shopping</dd>
    <dd>Painting</dd>
</dl>

Let's assume we want to get the content of the "dl" tag that occurs after the "Transportation". Namely the content of :

<dl>
    <dd>Flying</dd>
    <dd>Driving</dd>
    <dd>Sailing
        <dl>
            <dd>Boat</dd>
            <dd>Ship</dd>
        </dl>
    </dd>
</dl>

My initial attempt was to get the index of the paragraph( eg 1st,2nd etc.) and then get the corresponding dl, but this seem to be not working as dls can be nested.

Does anyone has a suggestion about how to get such content?

Assuming a HTML structured like your example, where the <dl> always follow the <p> , you could:

  • Get the desired <p> element through doc.getElementsContainingOwnText("txt") ;
  • Get the following <dl> using element.nextElementSibling(); .

Here's a sample code working on your HTML:

public static void main(String[] args) {
    Document doc = Jsoup.parse("<p>PageID: 2816; NS: 0; Title: some text; \r\nImage url: \r\nContent:\r\n{{Wort der Woche}}\r\n{{Siehe auch}}\r\n</p>\r\n<h2><span class=\"1\" id=\"e1\">some text</span></h2>\r\n<h3><span class=\"1\" id=\"e2\">some text</span></h3>\r\n\r\n<p>{{Transportation}}\r\n</p>\r\n<dl>\r\n    <dd>Flying</dd>\r\n    <dd>Driving</dd>\r\n    <dd>Sailing\r\n        <dl>\r\n            <dd>Boat</dd>\r\n            <dd>Ship</dd>\r\n        </dl>\r\n    </dd>\r\n</dl>\r\n\r\n<p>{{Activities}}\r\n</p>\r\n<dl>\r\n    <dd>Shopping</dd>\r\n    <dd>Painting</dd>\r\n</dl>");
    Elements e = doc.getElementsContainingOwnText("{{Transportation}}");
    Element nextDL = e.get(0).nextElementSibling();
    System.out.println(nextDL);
}

Output:

<dl> 
    <dd>Flying</dd> 
    <dd>Driving</dd> 
    <dd>Sailing 
        <dl> 
            <dd>Boat</dd> 
            <dd>Ship</dd> 
        </dl> 
    </dd> 
</dl>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM