I am using jsoup to scrape different html pages:
public class HtmlParse {
public static void main(String[] args) throws IOException {
String site = args[0];
Document doc = Jsoup.connect(site).get();
String htm = doc.body().text();
System.out.println(htm);
}
}
It works beautifully. However, there seems to be a lot of fluff associated with its returns (ie: website links [a href]). Is there a quick way to omit this in jsoup? I found the getElementsByTag literature but am having a hard time using it.
Thank you in advance.
You can "clean" parsed Document, see example . For exammple, to left only simple text:
Whitelist whitelist = Whitelist.simpleText();
String result = Jsoup.clean(doc.html(), whitelist);
Or, you can simple delete all a
tags:
doc.select("a").remove();
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.