简体   繁体   English

Jsoup.parse()与Jsoup.parse() - 或者如何在Jsoup中使用URL检测?

[英]Jsoup.parse() vs. Jsoup.parse() - or How does URL detection work in Jsoup?

Jsoup has 2 html parse() methods: Jsoup有2个html parse()方法:

  1. parse(String html) - "As no base URI is specified, absolute URL detection relies on the HTML including a tag." parse(String html) - “由于没有指定基URI,绝对URL检测依赖于包含标记的HTML。”
  2. parse(String html, String baseUri) - "The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a tag." parse(String html,String baseUri) - “检索HTML的URL。用于解析在HTML声明标记之前发生的绝对URL的相对URL。”

I am having a difficulty understanding the meaning of the difference between the two: 我很难理解两者之间差异的含义

  1. In the 2nd parse() version, what does "resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag" mean? 在第二个parse()版本中,“解析绝对URL的相对URL,在HTML声明<base href>标记之前发生的”是什么意思? What if a <base href> tag never occurs in the page? 如果<base href>标签永远不会出现在页面中怎么办?
  2. What is the purpose of absolute URL detection? 绝对URL检测的目的是什么? Why does Jsoup need to find the absolute URL? 为什么Jsoup需要找到绝对URL?
  3. Lastly, but most importantly: Is baseUri the full URL of HTML page (as phrased in original documentation) or is it the base URL of the HTML page? 最后,但最重要的是: baseUri是HTML页面的完整URL(如原始文档中所述)还是HTML页面的基本 URL?

It's used for among others Element#absUrl() so that you can retrieve the (intended) absolute URL of an <a href> , <img src> , <link href> , <script src> , etc. Eg 它用于Element#absUrl()以便您可以检索<a href><img src><link href><script src>等的(预期)绝对URL。例如

for (Element link : document.select("a")) {
    System.out.println(link.absUrl("href"));
}

This is very useful if you want to download and/or parse the linked resources as well. 如果您还想下载和/或解析链接的资源,这非常有用。


In the 2nd parse() version, what does "resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag" mean? 在第二个parse()版本中,“解析绝对URL的相对URL,在HTML声明<base href>标记之前发生的”是什么意思? What if a <base href> tag never occurs in the page? 如果<base href>标签永远不会出现在页面中怎么办?

Some (poor) websites may have declared a <link> or <script> with a relative URL before the <base> tag. 某些(差)网站可能已在<base>标记之前声明了<link><script>以及相对URL。 Or if there is no means of a <base> tag, then just the given baseUri will be used for resolving relative URLs of the entire document. 或者,如果没有<base>标记的方法,那么只有给定的baseUri将用于解析整个文档的相对URL。


What is the purpose of absolute URL detection? 绝对URL检测的目的是什么? Why does Jsoup need to find the absolute URL? 为什么Jsoup需要找到绝对URL?

In order to return the right URL on Element#absUrl() . 为了在Element#absUrl()上返回正确的URL。 This is purely for enduser's convenience. 这纯粹是为了最终用户的便利。 Jsoup doesn't need it in order to successfully parse the HTML at its own. Jsoup不需要它来成功解析HTML。


Lastly, but most importantly: Is baseUri the full URL of HTML page (as phrased in original documentation) or is it the base URL of the HTML page? 最后,但最重要的是:baseUri是HTML页面的完整URL(如原始文档中所述)还是HTML页面的基本URL?

The former. 前者。 If the latter, then documentation would be lying. 如果是后者,那么文件就会撒谎。 The baseUri must not to be confused with <base href> . baseUri不得与<base href>混淆。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM