在相对部分中使用多个斜杠解析URI

Question

I have to write a script in perl which parses uris from html. 我必须在perl中编写一个脚本来解析html中的uris。 Anyway, the real problem is how to resolve relative uris. 无论如何，真正的问题是如何解决相对的尿失禁。

I have base URI (base href in html) for example http://a/b/c/d;p?q (let's go through rfc3986 ) and different other URIs: 我有基本URI（HTML中的基本href），例如http：// a / b / c / d; p？q （让我们通过rfc3986 ）和其他不同的URI：

/g, //g, ///g, ////g, h//g, g////h, h///g:f / g，// g，/// g，//// g，h // g，g //// h，h /// g：f

In this RFC, section 5.4.1 (link above) there is only example of //g: 在此RFC的5.4.1节（上面的链接）中，仅提供// g的示例：

"//g" = " http://g " “ // g” =“ http：// g ”

What about all other cases? 那其他所有情况呢？ As far as I understood from rfc 3986, section 3.3 , multiple slashes are allowed. 据我从rfc 3986第3.3节了解，允许使用多个斜杠。 So, is following resolution correct? 那么，遵循分辨率是否正确？

"///g" = " http://a/b/c///g " “ /// g” =“ http：// a / b / c //// ”

Or what is should be? 或者应该是什么？ Does anyone can explain it better and prove it with not obsoleted rfc or documentation? 有没有人可以更好地解释它，并且不用RFC或文档就可以证明它？

Update #1 : Try to look at this working url - https:///stackoverflow.com////////a/////10161264/////6618577 更新＃1 ：尝试查看此工作网址-https：///stackoverflow.com////////a/////10161264/////6618577

What's going on here? 这里发生了什么？

Answer 1

I'll start by confirming that all the URIs you provided are valid, and by providing the outcome of the URI resolutions you mentioned (and the outcome of a couple of my own): 首先，确认您提供的所有URI均有效，并提供您提到的URI解析的结果（以及我自己的一些结果）：

$ perl -MURI -e'
   for my $rel (qw( /g //g ///g ////g h//g g////h h///g:f )) {
      my $uri = URI->new($rel)->abs("http://a/b/c/d;p?q");
      printf "%-20s + %-7s = %-20s   host: %-4s   path: %s\n",
         "http://a/b/c/d;p?q", $rel, $uri, $uri->host, $uri->path;
   }

   for my $base (qw( http://host/a/b/c/d http://host/a/b/c//d )) {
      my $uri = URI->new("../../e")->abs($base);
      printf "%-20s + %-7s = %-20s   host: %-4s   path: %s\n",
         $base, "../../e", $uri, $uri->host, $uri->path;
   }
'
http://a/b/c/d;p?q   + /g      = http://a/g             host: a      path: /g
http://a/b/c/d;p?q   + //g     = http://g               host: g      path:
http://a/b/c/d;p?q   + ///g    = http:///g              host:        path: /g
http://a/b/c/d;p?q   + ////g   = http:////g             host:        path: //g
http://a/b/c/d;p?q   + h//g    = http://a/b/c/h//g      host: a      path: /b/c/h//g
http://a/b/c/d;p?q   + g////h  = http://a/b/c/g////h    host: a      path: /b/c/g////h
http://a/b/c/d;p?q   + h///g:f = http://a/b/c/h///g:f   host: a      path: /b/c/h///g:f
http://host/a/b/c/d  + ../../e = http://host/a/e        host: host   path: /a/e
http://host/a/b/c//d + ../../e = http://host/a/b/e      host: host   path: /a/b/e

Next, we'll look at the syntax of relative URIs, since that's what your question circles around. 接下来，我们将研究相对URI的语法，因为这就是您的问题所围绕的内容。

relative-ref  = relative-part [ "?" query ] [ "#" fragment ]

relative-part = "//" authority path-abempty
              / path-absolute
              / path-noscheme
              / path-empty

path-abempty  = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )

segment       = *pchar         ; 0 or more <pchar>
segment-nz    = 1*pchar        ; 1 or more <pchar>   nz = non-zero

The key things from these rules for answering your question: 这些规则中回答您问题的关键：

An absolute path ( path-absolute ) can't start with // . 绝对路径（ path-absolute ）不能以//开头。 The first segment, if provided, must be non-zero in length. 如果提供，则第一段的长度必须不为零。 If the relative URI starts with // , what follows must be an authority . 如果相对URI以//开头，则必须是一个authority 。
// can otherwise occur in a path because segments can have zero-length. //否则会在路径中发生，因为段的长度可以为零。

Now, let's look at each of the resolutions you provided in turn. 现在，让我们依次看一下您提供的每个分辨率。

/g is an absolute path path-absolute , and thus a valid relative URI ( relative-ref ), and thus a valid URI ( URI-reference ). /g是绝对路径path-absolute ，因此是有效的相对URI（ relative-ref ），因此也是有效的URI（ URI-reference ）。

Parsing the URIs (say, using the regular expression in Appendix B) gives us the following: 解析URI（例如，使用附录B中的正则表达式）可为我们提供以下信息：

 Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: undef Base.path: "/b/c/d;p" R.path: "/g" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef

Following the algorithm in §5.2.2, we get: 遵循第5.2.2节中的算法，我们得到：

 T.path: "/g" ; remove_dot_segments(R.path) T.query: undef ; R.query T.authority: "a" ; Base.authority T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment

Following the algorithm in §5.3, we get: 按照第5.3节中的算法，我们得到：
```
 http://a/g 
```

//g is different. //g是不同的。 //g isn't an absolute path ( path_absolute ) because an absolute path can't start with an empty segment ( "/" [ segment-nz *( "/" segment ) ] ). //g 不是绝对路径（ path_absolute ），因为绝对路径不能以空段（ "/" [ segment-nz *( "/" segment ) ] ）开始。

Instead, it's follows the following pattern: 相反，它遵循以下模式：

"//" authority path-abempty

Parsing the URIs (say, using the regular expression in Appendix B) gives us the following: 解析URI（例如，使用附录B中的正则表达式）可为我们提供以下信息：
```
 Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: "g" Base.path: "/b/c/d;p" R.path: "" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef 
```

Following the algorithm in §5.2.2, we get the following: 遵循第5.2.2节中的算法，我们得到以下内容：

 T.authority: "g" ; R.authority T.path: "" ; remove_dot_segments(R.path) T.query: "" ; R.query T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment

Following the algorithm in §5.3, we get the following: 按照§5.3中的算法，我们得到以下内容：
```
 http://g 
```

Note : This contacts server g ! 注意：此联系服务器g ！

///g is similar to //g , except the authority is blank! ///g与//g相似，除了权限为空！ This is surprisingly valid. 这是令人惊讶的有效。

Parsing the URIs (say, using the regular expression in Appendix B) gives us the following: 解析URI（例如，使用附录B中的正则表达式）可为我们提供以下信息：
```
 Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: "" Base.path: "/b/c/d;p" R.path: "/g" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef 
```

Following the algorithm in §5.2.2, we get the following: 遵循第5.2.2节中的算法，我们得到以下内容：

 T.authority: "" ; R.authority T.path: "/g" ; remove_dot_segments(R.path) T.query: undef ; R.query T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment

Following the algorithm in §5.3, we get the following: 按照§5.3中的算法，我们得到以下内容：
```
 http:///g 
```

Note : While valid, this URI is useless because the server name ( T.authority ) is blank! 注意：有效时，此URI无效，因为服务器名称（ T.authority ）为空！

////g is the same as ///g except the R.path is //g , so we get ////g与///g相同， R.path是//g ，所以我们得到

    http:////g

Note : While valid, this URI is useless because the server name ( T.authority ) is blank! 注意：有效时，此URI无效，因为服务器名称（ T.authority ）为空！

The final three ( h//g , g////h , h///g:f ) are all relative paths ( path-noscheme ). 最后三个（ h//g ， g////h ， h///g:f ）都是相对路径（ path-noscheme ）。

Parsing the URIs (say, using the regular expression in Appendix B) gives us the following: 解析URI（例如，使用附录B中的正则表达式）可为我们提供以下信息：

 Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: undef Base.path: "/b/c/d;p" R.path: "h//g" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef

Following the algorithm in §5.2.2, we get the following: 遵循第5.2.2节中的算法，我们得到以下内容：

 T.path: "/b/c/h//g" ; remove_dot_segments(merge(Base.path, R.path)) T.query: undef ; R.query T.authority: "a" ; Base.authority T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment

Following the algorithm in §5.3, we get the following: 按照§5.3中的算法，我们得到以下内容：
```
 http://a/b/c/h//g # For h//g http://a/b/c/g////h # For g////h http://a/b/c/h///g:f # For h///g:f 
```

I don't think the examples are suitable for answering what I think you really want to know, though. 不过，我认为这些示例并不适合回答我认为您真正想知道的内容。

Take a look at the following two URIs. 看一下以下两个URI。 They aren't equivalent. 它们不相等。

http://host/a/b/c/d     # Path has 4 segments: "a", "b", "c", "d"

and 和

http://host/a/b/c//d    # Path has 5 segments: "a", "b", "c", "", "d"

Most servers will treat them the same —which is fine since servers are free to interpret paths in any way they wish— but it makes a difference when applying relative paths. 大多数服务器将对它们进行相同的处理（这很好，因为服务器可以随意使用任意方式解释路径），但是在应用相对路径时会有所不同。 For example, if these were the base URI for ../../e , you'd get 例如，如果这些是../../e的基本URI，您将得到

http://host/a/b/c/d + ../../e = http://host/a/e

and 和

http://host/a/b/c//d + ../../e = http://host/a/b/e

Answer 2

I was curious what Mojo::URL would do so I checked. 我很好奇Mojo :: URL会做什么，所以我检查了一下。 There's a big caveat because it doesn't claim to be strictly compliant: 有一个很大的警告，因为它并不声称严格合规：

Mojo::URL implements a subset of RFC 3986, RFC 3987 and the URL Living Standard for Uniform Resource Locators with support for IDNA and IRIs. Mojo :: URL实现了RFC 3986，RFC 3987和统一资源定位符的URL生活标准的子集，并支持IDNA和IRI。

Here's the program. 这是程序。

my @urls = qw(/g //g ///g ////g h//g g////h h///g:f
    https:///stackoverflow.com////////a/////10161264/////6618577
    );
my @parts = qw(scheme host port path query);
my $template = join "\n", map { "$_: %s" } @parts;

my $base_url = Mojo::URL->new( 'http://a/b/c/d;p?q' );

foreach my $u ( @urls ) {
    my $url = Mojo::URL->new( $u )->base( $base_url )->to_abs;

    no warnings qw(uninitialized);
    say '-' x 40;
    printf "%s\n$template", $u, map { $url->$_() } @parts
    }

Here's the output: 这是输出：

----------------------------------------
/g
scheme: http
host: a
port:
path: /g
query: ----------------------------------------
//g
scheme: http
host: g
port:
path:
query: ----------------------------------------
///g
scheme: http
host: a
port:
path: /g
query: ----------------------------------------
////g
scheme: http
host: a
port:
path: //g
query: ----------------------------------------
h//g
scheme: http
host: a
port:
path: /b/c/h/g
query: ----------------------------------------
g////h
scheme: http
host: a
port:
path: /b/c/g/h
query: ----------------------------------------
h///g:f
scheme: http
host: a
port:
path: /b/c/h/g:f
query: ----------------------------------------
https:///stackoverflow.com////////a/////10161264/////6618577
scheme: https
host:
port:
path: /stackoverflow.com////////a/////10161264/////6618577
query:

Answer 3

No - ///g would seem more equivalent to /g . 不- ///g似乎更等同于/g 。 The "dot-segments" .. and . “点段” ..和. are what is used to navigate up and down the hierarchy with http URLs. 是用于通过http URL在层次结构中上下导航的内容。 See also the URI module for handling paths in URIs. 另请参阅URI模块以处理URI中的路径。

在相对部分中使用多个斜杠解析URI

问题描述

3 个解决方案

解决方案1
4 已采纳 2018-10-04 21:26:53

解决方案2
1 2018-10-04 16:12:57

解决方案3
-1 2018-10-04 14:51:28

在相对部分中使用多个斜杠解析URI

问题描述

3 个解决方案

解决方案1 4 已采纳 2018-10-04 21:26:53

解决方案2 1 2018-10-04 16:12:57

解决方案3 -1 2018-10-04 14:51:28

解决方案1
4 已采纳 2018-10-04 21:26:53

解决方案2
1 2018-10-04 16:12:57

解决方案3
-1 2018-10-04 14:51:28