Hi I am trying to parse whole bunch of html with Jsoup, but cannot achieve the desired goal. I am parsing it from generated javadoc, which is poor of ids or other helpful elements for parsing. Also another problem is that I have to parse same tags in the same document, so i cannot have strictly defined select. I managed to do it, but still have a problem with method name, which is in the other sibling of the DOM. Here is my html and desired parsing result: http://img62.imageshack.us/img62/9870/08bz.png
I have to 'tie' in some way 'pre' and 'ol' tags (parse the desired html range). Please, help me.
Tryed to do it in such way: Elements methodName = doc.select("pre:contains(public), dl > dd > ol");
but this returns me too much methods names.
If I understand correctly, you want only the public void method_name()
and the list items that explain what the method does, but without any additional html tags.
Elements methodName = doc.select("pre:contains(public), dl > dd > ol > li");
This will select 4 elements in total - the method name and the three list items, but they will still have html tags surrounding them such as <pre>
and <li>
. Call the text()
method on each Element to remove those tags:
for (Element e : methodName) {
System.out.println(e.text());
}
Which outputs:
11-08 10:47:19.468: I/System.out(816): public void test()
11-08 10:47:19.468: I/System.out(816): Navigates to app
11-08 10:47:19.468: I/System.out(816): opens main panel
11-08 10:47:19.478: I/System.out(816): starts it
Due to the lack of any id attributes, I don't think it is possible to select only the relevant tags with one select statement. So instead you could iterate through the Elements
that you do select and check whether a <pre>
tag is followed by a <li>
tag (assuming you use the same doc.select()
statement I used in my first answer).
Example:
Elements methodName = doc.select("pre:contains(public), dl > dd > ol > li");
for (int i = 0; i < methodName.size(); i++) {
if (methodName.get(i).tagName().equals("pre")) { // if the <pre> tag
if (methodName.get(i + 1).tagName().equals("li")) { // is followed by a <li> tag
System.out.println(methodName.get(i).text()); // print it
}
} else System.out.println(methodName.get(i).text()); // else it is a <li> tag so print it
}
This will provide the same output as my first example, even if there are two other <pre>
tags with methodNames that don't have a <ol>
list following (as you mentioned in your comment).
Note: Depending on how your document is formatted, you might have to watch out for an IndexOutOfBoundsException
(when I call i + 1
), but you can just add another check for that.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.