简体   繁体   中英

Jsoup.parse() vs. Jsoup.parse() - or How does URL detection work in Jsoup?

Jsoup has 2 html parse() methods:

  1. parse(String html) - "As no base URI is specified, absolute URL detection relies on the HTML including a tag."
  2. parse(String html, String baseUri) - "The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a tag."

I am having a difficulty understanding the meaning of the difference between the two:

  1. In the 2nd parse() version, what does "resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag" mean? What if a <base href> tag never occurs in the page?
  2. What is the purpose of absolute URL detection? Why does Jsoup need to find the absolute URL?
  3. Lastly, but most importantly: Is baseUri the full URL of HTML page (as phrased in original documentation) or is it the base URL of the HTML page?

It's used for among others Element#absUrl() so that you can retrieve the (intended) absolute URL of an <a href> , <img src> , <link href> , <script src> , etc. Eg

for (Element link : document.select("a")) {
    System.out.println(link.absUrl("href"));
}

This is very useful if you want to download and/or parse the linked resources as well.


In the 2nd parse() version, what does "resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag" mean? What if a <base href> tag never occurs in the page?

Some (poor) websites may have declared a <link> or <script> with a relative URL before the <base> tag. Or if there is no means of a <base> tag, then just the given baseUri will be used for resolving relative URLs of the entire document.


What is the purpose of absolute URL detection? Why does Jsoup need to find the absolute URL?

In order to return the right URL on Element#absUrl() . This is purely for enduser's convenience. Jsoup doesn't need it in order to successfully parse the HTML at its own.


Lastly, but most importantly: Is baseUri the full URL of HTML page (as phrased in original documentation) or is it the base URL of the HTML page?

The former. If the latter, then documentation would be lying. The baseUri must not to be confused with <base href> .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM