简体   繁体   中英

How to parse elements with Jsoup in given select range?

Hi I am trying to parse whole bunch of html with Jsoup, but cannot achieve the desired goal. I am parsing it from generated javadoc, which is poor of ids or other helpful elements for parsing. Also another problem is that I have to parse same tags in the same document, so i cannot have strictly defined select. I managed to do it, but still have a problem with method name, which is in the other sibling of the DOM. Here is my html and desired parsing result: http://img62.imageshack.us/img62/9870/08bz.png

I have to 'tie' in some way 'pre' and 'ol' tags (parse the desired html range). Please, help me.

Tryed to do it in such way: Elements methodName = doc.select("pre:contains(public), dl > dd > ol"); but this returns me too much methods names.

If I understand correctly, you want only the public void method_name() and the list items that explain what the method does, but without any additional html tags.

Elements methodName = doc.select("pre:contains(public), dl > dd > ol > li");

This will select 4 elements in total - the method name and the three list items, but they will still have html tags surrounding them such as <pre> and <li> . Call the text() method on each Element to remove those tags:

for (Element e : methodName) {  
    System.out.println(e.text());
}

Which outputs:

11-08 10:47:19.468: I/System.out(816): public void test()
11-08 10:47:19.468: I/System.out(816): Navigates to app
11-08 10:47:19.468: I/System.out(816): opens main panel
11-08 10:47:19.478: I/System.out(816): starts it

Due to the lack of any id attributes, I don't think it is possible to select only the relevant tags with one select statement. So instead you could iterate through the Elements that you do select and check whether a <pre> tag is followed by a <li> tag (assuming you use the same doc.select() statement I used in my first answer).

Example:

Elements methodName = doc.select("pre:contains(public), dl > dd > ol > li");

for (int i = 0; i < methodName.size(); i++) {
    if (methodName.get(i).tagName().equals("pre")) {        // if the <pre> tag
        if (methodName.get(i + 1).tagName().equals("li")) { // is followed by a <li> tag
            System.out.println(methodName.get(i).text());   // print it
        }
    } else System.out.println(methodName.get(i).text());    // else it is a <li> tag so print it
}  

This will provide the same output as my first example, even if there are two other <pre> tags with methodNames that don't have a <ol> list following (as you mentioned in your comment).

Note: Depending on how your document is formatted, you might have to watch out for an IndexOutOfBoundsException (when I call i + 1 ), but you can just add another check for that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM