简体   繁体   中英

using jsoup to parse html but not follow/fetch links

What is the "correct" way to use JSoup to parse html string or stream without fetching external data for link/img/area/iframe (and whatever other) tags? Right now I am doing something like this after I fetch a page using Apache HttpComponents :

HttpEntity entity = response.getEntity();
InputStream is = entity.getContent();
Document = JSoup.parse(is, null, "");

Which actually works fine. But passing the baseUri as empty just feels wrong , because I am betting JSoup tries to use it, only to fail and move on. I only want to use JSoup as an html parser and DOM manipulation kit, not an http framework. I am also a bit worried that JSoup might try to look for ="/foo" resources in the current directory or something. What does it do with an empty string? I tried passing null as the baseUri, which would be a natural interface for doing what I want, but it dies with an IllegalStateException.

Is there a way to do this, or am I worried about nothing?

... I don't think think that JSoup does that. The URL parameter is just for the canonicalization of relative URLs, what you do with them is your responsibility. JSoup will not by itself try to access resources.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM