简体   繁体   中英

Omit links, ads, etc. from jsoup parse

I am using jsoup to scrape different html pages:

public class HtmlParse {
    public static void main(String[] args) throws IOException {
     String site = args[0];
        Document doc = Jsoup.connect(site).get();
        String htm = doc.body().text();
        System.out.println(htm);
    }
}

It works beautifully. However, there seems to be a lot of fluff associated with its returns (ie: website links [a href]). Is there a quick way to omit this in jsoup? I found the getElementsByTag literature but am having a hard time using it.

Thank you in advance.

You can "clean" parsed Document, see example . For exammple, to left only simple text:

Whitelist whitelist = Whitelist.simpleText();
String result = Jsoup.clean(doc.html(), whitelist);

Or, you can simple delete all a tags:

doc.select("a").remove();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM