简体   繁体   中英

Resolve URI with multiple slashes in relative part

I have to write a script in perl which parses uris from html. Anyway, the real problem is how to resolve relative uris.

I have base URI (base href in html) for example http://a/b/c/d;p?q (let's go through rfc3986 ) and different other URIs:

/g, //g, ///g, ////g, h//g, g////h, h///g:f

In this RFC, section 5.4.1 (link above) there is only example of //g:

"//g" = " http://g "

What about all other cases? As far as I understood from rfc 3986, section 3.3 , multiple slashes are allowed. So, is following resolution correct?

"///g" = " http://a/b/c///g "

Or what is should be? Does anyone can explain it better and prove it with not obsoleted rfc or documentation?

Update #1 : Try to look at this working url - https:///stackoverflow.com////////a/////10161264/////6618577

What's going on here?

I'll start by confirming that all the URIs you provided are valid, and by providing the outcome of the URI resolutions you mentioned (and the outcome of a couple of my own):

$ perl -MURI -e'
   for my $rel (qw( /g //g ///g ////g h//g g////h h///g:f )) {
      my $uri = URI->new($rel)->abs("http://a/b/c/d;p?q");
      printf "%-20s + %-7s = %-20s   host: %-4s   path: %s\n",
         "http://a/b/c/d;p?q", $rel, $uri, $uri->host, $uri->path;
   }

   for my $base (qw( http://host/a/b/c/d http://host/a/b/c//d )) {
      my $uri = URI->new("../../e")->abs($base);
      printf "%-20s + %-7s = %-20s   host: %-4s   path: %s\n",
         $base, "../../e", $uri, $uri->host, $uri->path;
   }
'
http://a/b/c/d;p?q   + /g      = http://a/g             host: a      path: /g
http://a/b/c/d;p?q   + //g     = http://g               host: g      path:
http://a/b/c/d;p?q   + ///g    = http:///g              host:        path: /g
http://a/b/c/d;p?q   + ////g   = http:////g             host:        path: //g
http://a/b/c/d;p?q   + h//g    = http://a/b/c/h//g      host: a      path: /b/c/h//g
http://a/b/c/d;p?q   + g////h  = http://a/b/c/g////h    host: a      path: /b/c/g////h
http://a/b/c/d;p?q   + h///g:f = http://a/b/c/h///g:f   host: a      path: /b/c/h///g:f
http://host/a/b/c/d  + ../../e = http://host/a/e        host: host   path: /a/e
http://host/a/b/c//d + ../../e = http://host/a/b/e      host: host   path: /a/b/e

Next, we'll look at the syntax of relative URIs, since that's what your question circles around.

relative-ref  = relative-part [ "?" query ] [ "#" fragment ]

relative-part = "//" authority path-abempty
              / path-absolute
              / path-noscheme
              / path-empty

path-abempty  = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )

segment       = *pchar         ; 0 or more <pchar>
segment-nz    = 1*pchar        ; 1 or more <pchar>   nz = non-zero

The key things from these rules for answering your question:

  • An absolute path ( path-absolute ) can't start with // . The first segment, if provided, must be non-zero in length. If the relative URI starts with // , what follows must be an authority .
  • // can otherwise occur in a path because segments can have zero-length.

Now, let's look at each of the resolutions you provided in turn.

/g is an absolute path path-absolute , and thus a valid relative URI ( relative-ref ), and thus a valid URI ( URI-reference ).

  • Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:

     Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: undef Base.path: "/b/c/d;p" R.path: "/g" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef 
  • Following the algorithm in §5.2.2, we get:

     T.path: "/g" ; remove_dot_segments(R.path) T.query: undef ; R.query T.authority: "a" ; Base.authority T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment 
  • Following the algorithm in §5.3, we get:

     http://a/g 

//g is different. //g isn't an absolute path ( path_absolute ) because an absolute path can't start with an empty segment ( "/" [ segment-nz *( "/" segment ) ] ).

Instead, it's follows the following pattern:

"//" authority path-abempty
  • Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:

     Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: "g" Base.path: "/b/c/d;p" R.path: "" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef 
  • Following the algorithm in §5.2.2, we get the following:

     T.authority: "g" ; R.authority T.path: "" ; remove_dot_segments(R.path) T.query: "" ; R.query T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment 
  • Following the algorithm in §5.3, we get the following:

     http://g 

Note : This contacts server g !


///g is similar to //g , except the authority is blank! This is surprisingly valid.

  • Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:

     Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: "" Base.path: "/b/c/d;p" R.path: "/g" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef 
  • Following the algorithm in §5.2.2, we get the following:

     T.authority: "" ; R.authority T.path: "/g" ; remove_dot_segments(R.path) T.query: undef ; R.query T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment 
  • Following the algorithm in §5.3, we get the following:

     http:///g 

Note : While valid, this URI is useless because the server name ( T.authority ) is blank!


////g is the same as ///g except the R.path is //g , so we get

    http:////g

Note : While valid, this URI is useless because the server name ( T.authority ) is blank!


The final three ( h//g , g////h , h///g:f ) are all relative paths ( path-noscheme ).

  • Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:

     Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: undef Base.path: "/b/c/d;p" R.path: "h//g" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef 
  • Following the algorithm in §5.2.2, we get the following:

     T.path: "/b/c/h//g" ; remove_dot_segments(merge(Base.path, R.path)) T.query: undef ; R.query T.authority: "a" ; Base.authority T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment 
  • Following the algorithm in §5.3, we get the following:

     http://a/b/c/h//g # For h//g http://a/b/c/g////h # For g////h http://a/b/c/h///g:f # For h///g:f 

I don't think the examples are suitable for answering what I think you really want to know, though.

Take a look at the following two URIs. They aren't equivalent.

http://host/a/b/c/d     # Path has 4 segments: "a", "b", "c", "d"

and

http://host/a/b/c//d    # Path has 5 segments: "a", "b", "c", "", "d"

Most servers will treat them the same —which is fine since servers are free to interpret paths in any way they wish— but it makes a difference when applying relative paths. For example, if these were the base URI for ../../e , you'd get

http://host/a/b/c/d + ../../e = http://host/a/e

and

http://host/a/b/c//d + ../../e = http://host/a/b/e

I was curious what Mojo::URL would do so I checked. There's a big caveat because it doesn't claim to be strictly compliant:

Mojo::URL implements a subset of RFC 3986, RFC 3987 and the URL Living Standard for Uniform Resource Locators with support for IDNA and IRIs.

Here's the program.

my @urls = qw(/g //g ///g ////g h//g g////h h///g:f
    https:///stackoverflow.com////////a/////10161264/////6618577
    );
my @parts = qw(scheme host port path query);
my $template = join "\n", map { "$_: %s" } @parts;

my $base_url = Mojo::URL->new( 'http://a/b/c/d;p?q' );

foreach my $u ( @urls ) {
    my $url = Mojo::URL->new( $u )->base( $base_url )->to_abs;

    no warnings qw(uninitialized);
    say '-' x 40;
    printf "%s\n$template", $u, map { $url->$_() } @parts
    }

Here's the output:

----------------------------------------
/g
scheme: http
host: a
port:
path: /g
query: ----------------------------------------
//g
scheme: http
host: g
port:
path:
query: ----------------------------------------
///g
scheme: http
host: a
port:
path: /g
query: ----------------------------------------
////g
scheme: http
host: a
port:
path: //g
query: ----------------------------------------
h//g
scheme: http
host: a
port:
path: /b/c/h/g
query: ----------------------------------------
g////h
scheme: http
host: a
port:
path: /b/c/g/h
query: ----------------------------------------
h///g:f
scheme: http
host: a
port:
path: /b/c/h/g:f
query: ----------------------------------------
https:///stackoverflow.com////////a/////10161264/////6618577
scheme: https
host:
port:
path: /stackoverflow.com////////a/////10161264/////6618577
query:

No - ///g would seem more equivalent to /g . The "dot-segments" .. and . are what is used to navigate up and down the hierarchy with http URLs. See also the URI module for handling paths in URIs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM