Omit links, ads, etc. from jsoup parse

Question

I am using jsoup to scrape different html pages:

public class HtmlParse {
    public static void main(String[] args) throws IOException {
     String site = args[0];
        Document doc = Jsoup.connect(site).get();
        String htm = doc.body().text();
        System.out.println(htm);
    }
}

It works beautifully. However, there seems to be a lot of fluff associated with its returns (ie: website links [a href]). Is there a quick way to omit this in jsoup? I found the getElementsByTag literature but am having a hard time using it.

Thank you in advance.

Answer 1

You can "clean" parsed Document, see example . For exammple, to left only simple text:

Whitelist whitelist = Whitelist.simpleText();
String result = Jsoup.clean(doc.html(), whitelist);

Or, you can simple delete all a tags:

doc.select("a").remove();

Omit links, ads, etc. from jsoup parse

Question

1 answers

solution1
6 ACCPTED 2012-04-18 14:16:26

Omit links, ads, etc. from jsoup parse

Question

1 answers

solution1 6 ACCPTED 2012-04-18 14:16:26

solution1
6 ACCPTED 2012-04-18 14:16:26