using jsoup to parse html but not follow/fetch links

Question

What is the "correct" way to use JSoup to parse html string or stream without fetching external data for link/img/area/iframe (and whatever other) tags? Right now I am doing something like this after I fetch a page using Apache HttpComponents :

HttpEntity entity = response.getEntity();
InputStream is = entity.getContent();
Document = JSoup.parse(is, null, "");

Which actually works fine. But passing the baseUri as empty just feels wrong , because I am betting JSoup tries to use it, only to fail and move on. I only want to use JSoup as an html parser and DOM manipulation kit, not an http framework. I am also a bit worried that JSoup might try to look for ="/foo" resources in the current directory or something. What does it do with an empty string? I tried passing null as the baseUri, which would be a natural interface for doing what I want, but it dies with an IllegalStateException.

Is there a way to do this, or am I worried about nothing?

Answer 1

... I don't think think that JSoup does that. The URL parameter is just for the canonicalization of relative URLs, what you do with them is your responsibility. JSoup will not by itself try to access resources.

using jsoup to parse html but not follow/fetch links

Question

1 answers

solution1
1 ACCPTED 2013-09-15 05:58:23

using jsoup to parse html but not follow/fetch links

Question

1 answers

solution1 1 ACCPTED 2013-09-15 05:58:23

solution1
1 ACCPTED 2013-09-15 05:58:23