简体   繁体   English

在相对部分中使用多个斜杠解析URI

[英]Resolve URI with multiple slashes in relative part

I have to write a script in perl which parses uris from html. 我必须在perl中编写一个脚本来解析html中的uris。 Anyway, the real problem is how to resolve relative uris. 无论如何,真正的问题是如何解决相对的尿失禁。

I have base URI (base href in html) for example http://a/b/c/d;p?q (let's go through rfc3986 ) and different other URIs: 我有基本URI(HTML中的基本href),例如http:// a / b / c / d; p?q (让我们通过rfc3986 )和其他不同的URI:

/g, //g, ///g, ////g, h//g, g////h, h///g:f / g,// g,/// g,//// g,h // g,g //// h,h /// g:f

In this RFC, section 5.4.1 (link above) there is only example of //g: 在此RFC的5.4.1节(上面的链接)中,仅提供// g的示例:

"//g" = " http://g " “ // g” =“ http:// g

What about all other cases? 那其他所有情况呢? As far as I understood from rfc 3986, section 3.3 , multiple slashes are allowed. 据我从rfc 3986第3.3节了解,允许使用多个斜杠。 So, is following resolution correct? 那么,遵循分辨率是否正确?

"///g" = " http://a/b/c///g " “ /// g” =“ http:// a / b / c ////

Or what is should be? 或者应该是什么? Does anyone can explain it better and prove it with not obsoleted rfc or documentation? 有没有人可以更好地解释它,并且不用RFC或文档就可以证明它?

Update #1 : Try to look at this working url - https:///stackoverflow.com////////a/////10161264/////6618577 更新#1 :尝试查看此工作网址-https:///stackoverflow.com////////a/////10161264/////6618577

What's going on here? 这里发生了什么?

I'll start by confirming that all the URIs you provided are valid, and by providing the outcome of the URI resolutions you mentioned (and the outcome of a couple of my own): 首先,确认您提供的所有URI均有效,并提供您提到的URI解析的结果(以及我自己的一些结果):

$ perl -MURI -e'
   for my $rel (qw( /g //g ///g ////g h//g g////h h///g:f )) {
      my $uri = URI->new($rel)->abs("http://a/b/c/d;p?q");
      printf "%-20s + %-7s = %-20s   host: %-4s   path: %s\n",
         "http://a/b/c/d;p?q", $rel, $uri, $uri->host, $uri->path;
   }

   for my $base (qw( http://host/a/b/c/d http://host/a/b/c//d )) {
      my $uri = URI->new("../../e")->abs($base);
      printf "%-20s + %-7s = %-20s   host: %-4s   path: %s\n",
         $base, "../../e", $uri, $uri->host, $uri->path;
   }
'
http://a/b/c/d;p?q   + /g      = http://a/g             host: a      path: /g
http://a/b/c/d;p?q   + //g     = http://g               host: g      path:
http://a/b/c/d;p?q   + ///g    = http:///g              host:        path: /g
http://a/b/c/d;p?q   + ////g   = http:////g             host:        path: //g
http://a/b/c/d;p?q   + h//g    = http://a/b/c/h//g      host: a      path: /b/c/h//g
http://a/b/c/d;p?q   + g////h  = http://a/b/c/g////h    host: a      path: /b/c/g////h
http://a/b/c/d;p?q   + h///g:f = http://a/b/c/h///g:f   host: a      path: /b/c/h///g:f
http://host/a/b/c/d  + ../../e = http://host/a/e        host: host   path: /a/e
http://host/a/b/c//d + ../../e = http://host/a/b/e      host: host   path: /a/b/e

Next, we'll look at the syntax of relative URIs, since that's what your question circles around. 接下来,我们将研究相对URI的语法,因为这就是您的问题所围绕的内容。

relative-ref  = relative-part [ "?" query ] [ "#" fragment ]

relative-part = "//" authority path-abempty
              / path-absolute
              / path-noscheme
              / path-empty

path-abempty  = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )

segment       = *pchar         ; 0 or more <pchar>
segment-nz    = 1*pchar        ; 1 or more <pchar>   nz = non-zero

The key things from these rules for answering your question: 这些规则中回答您问题的关键:

  • An absolute path ( path-absolute ) can't start with // . 绝对路径( path-absolute )不能以//开头。 The first segment, if provided, must be non-zero in length. 如果提供,则第一段的长度必须不为零。 If the relative URI starts with // , what follows must be an authority . 如果相对URI以//开头,则必须是一个authority
  • // can otherwise occur in a path because segments can have zero-length. //否则会在路径中发生,因为段的长度可以为零。

Now, let's look at each of the resolutions you provided in turn. 现在,让我们依次看一下您提供的每个分辨率。

/g is an absolute path path-absolute , and thus a valid relative URI ( relative-ref ), and thus a valid URI ( URI-reference ). /g是绝对路径path-absolute ,因此是有效的相对URI( relative-ref ),因此也是有效的URI( URI-reference )。

  • Parsing the URIs (say, using the regular expression in Appendix B) gives us the following: 解析URI(例如,使用附录B中的正则表达式)可为我们提供以下信息:

     Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: undef Base.path: "/b/c/d;p" R.path: "/g" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef 
  • Following the algorithm in §5.2.2, we get: 遵循第5.2.2节中的算法,我们得到:

     T.path: "/g" ; remove_dot_segments(R.path) T.query: undef ; R.query T.authority: "a" ; Base.authority T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment 
  • Following the algorithm in §5.3, we get: 按照第5.3节中的算法,我们得到:

     http://a/g 

//g is different. //g是不同的。 //g isn't an absolute path ( path_absolute ) because an absolute path can't start with an empty segment ( "/" [ segment-nz *( "/" segment ) ] ). //g 不是绝对路径( path_absolute ),因为绝对路径不能以空段( "/" [ segment-nz *( "/" segment ) ] )开始。

Instead, it's follows the following pattern: 相反,它遵循以下模式:

"//" authority path-abempty
  • Parsing the URIs (say, using the regular expression in Appendix B) gives us the following: 解析URI(例如,使用附录B中的正则表达式)可为我们提供以下信息:

     Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: "g" Base.path: "/b/c/d;p" R.path: "" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef 
  • Following the algorithm in §5.2.2, we get the following: 遵循第5.2.2节中的算法,我们得到以下内容:

     T.authority: "g" ; R.authority T.path: "" ; remove_dot_segments(R.path) T.query: "" ; R.query T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment 
  • Following the algorithm in §5.3, we get the following: 按照§5.3中的算法,我们得到以下内容:

     http://g 

Note : This contacts server g ! 注意 :此联系服务器g


///g is similar to //g , except the authority is blank! ///g//g相似,除了权限为空! This is surprisingly valid. 这是令人惊讶的有效。

  • Parsing the URIs (say, using the regular expression in Appendix B) gives us the following: 解析URI(例如,使用附录B中的正则表达式)可为我们提供以下信息:

     Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: "" Base.path: "/b/c/d;p" R.path: "/g" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef 
  • Following the algorithm in §5.2.2, we get the following: 遵循第5.2.2节中的算法,我们得到以下内容:

     T.authority: "" ; R.authority T.path: "/g" ; remove_dot_segments(R.path) T.query: undef ; R.query T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment 
  • Following the algorithm in §5.3, we get the following: 按照§5.3中的算法,我们得到以下内容:

     http:///g 

Note : While valid, this URI is useless because the server name ( T.authority ) is blank! 注意 :有效时,此URI无效,因为服务器名称( T.authority )为空!


////g is the same as ///g except the R.path is //g , so we get ////g///g相同, R.path//g ,所以我们得到

    http:////g

Note : While valid, this URI is useless because the server name ( T.authority ) is blank! 注意 :有效时,此URI无效,因为服务器名称( T.authority )为空!


The final three ( h//g , g////h , h///g:f ) are all relative paths ( path-noscheme ). 最后三个( h//gg////hh///g:f )都是相对路径( path-noscheme )。

  • Parsing the URIs (say, using the regular expression in Appendix B) gives us the following: 解析URI(例如,使用附录B中的正则表达式)可为我们提供以下信息:

     Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: undef Base.path: "/b/c/d;p" R.path: "h//g" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef 
  • Following the algorithm in §5.2.2, we get the following: 遵循第5.2.2节中的算法,我们得到以下内容:

     T.path: "/b/c/h//g" ; remove_dot_segments(merge(Base.path, R.path)) T.query: undef ; R.query T.authority: "a" ; Base.authority T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment 
  • Following the algorithm in §5.3, we get the following: 按照§5.3中的算法,我们得到以下内容:

     http://a/b/c/h//g # For h//g http://a/b/c/g////h # For g////h http://a/b/c/h///g:f # For h///g:f 

I don't think the examples are suitable for answering what I think you really want to know, though. 不过,我认为这些示例并不适合回答我认为您真正想知道的内容。

Take a look at the following two URIs. 看一下以下两个URI。 They aren't equivalent. 它们相等。

http://host/a/b/c/d     # Path has 4 segments: "a", "b", "c", "d"

and

http://host/a/b/c//d    # Path has 5 segments: "a", "b", "c", "", "d"

Most servers will treat them the same —which is fine since servers are free to interpret paths in any way they wish— but it makes a difference when applying relative paths. 大多数服务器将对它们进行相同的处理(这很好,因为服务器可以随意使用任意方式解释路径),但是在应用相对路径时会有所不同。 For example, if these were the base URI for ../../e , you'd get 例如,如果这些是../../e的基本URI,您将得到

http://host/a/b/c/d + ../../e = http://host/a/e

and

http://host/a/b/c//d + ../../e = http://host/a/b/e

I was curious what Mojo::URL would do so I checked. 我很好奇Mojo :: URL会做什么,所以我检查了一下。 There's a big caveat because it doesn't claim to be strictly compliant: 有一个很大的警告,因为它并不声称严格合规:

Mojo::URL implements a subset of RFC 3986, RFC 3987 and the URL Living Standard for Uniform Resource Locators with support for IDNA and IRIs. Mojo :: URL实现了RFC 3986,RFC 3987和统一资源定位符的URL生活标准的子集,并支持IDNA和IRI。

Here's the program. 这是程序。

my @urls = qw(/g //g ///g ////g h//g g////h h///g:f
    https:///stackoverflow.com////////a/////10161264/////6618577
    );
my @parts = qw(scheme host port path query);
my $template = join "\n", map { "$_: %s" } @parts;

my $base_url = Mojo::URL->new( 'http://a/b/c/d;p?q' );

foreach my $u ( @urls ) {
    my $url = Mojo::URL->new( $u )->base( $base_url )->to_abs;

    no warnings qw(uninitialized);
    say '-' x 40;
    printf "%s\n$template", $u, map { $url->$_() } @parts
    }

Here's the output: 这是输出:

----------------------------------------
/g
scheme: http
host: a
port:
path: /g
query: ----------------------------------------
//g
scheme: http
host: g
port:
path:
query: ----------------------------------------
///g
scheme: http
host: a
port:
path: /g
query: ----------------------------------------
////g
scheme: http
host: a
port:
path: //g
query: ----------------------------------------
h//g
scheme: http
host: a
port:
path: /b/c/h/g
query: ----------------------------------------
g////h
scheme: http
host: a
port:
path: /b/c/g/h
query: ----------------------------------------
h///g:f
scheme: http
host: a
port:
path: /b/c/h/g:f
query: ----------------------------------------
https:///stackoverflow.com////////a/////10161264/////6618577
scheme: https
host:
port:
path: /stackoverflow.com////////a/////10161264/////6618577
query:

No - ///g would seem more equivalent to /g . 不- ///g似乎更等同于/g The "dot-segments" .. and . “点段” ... are what is used to navigate up and down the hierarchy with http URLs. 是用于通过http URL在层次结构中上下导航的内容。 See also the URI module for handling paths in URIs. 另请参阅URI模块以处理URI中的路径。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM