I have to write a script in perl which parses uris from html. Anyway, the real problem is how to resolve relative uris.
I have base URI (base href in html) for example http://a/b/c/d;p?q (let's go through rfc3986 ) and different other URIs:
/g, //g, ///g, ////g, h//g, g////h, h///g:f
In this RFC, section 5.4.1 (link above) there is only example of //g:
"//g" = " http://g "
What about all other cases? As far as I understood from rfc 3986, section 3.3 , multiple slashes are allowed. So, is following resolution correct?
"///g" = " http://a/b/c///g "
Or what is should be? Does anyone can explain it better and prove it with not obsoleted rfc or documentation?
Update #1 : Try to look at this working url - https:///stackoverflow.com////////a/////10161264/////6618577
What's going on here?
I'll start by confirming that all the URIs you provided are valid, and by providing the outcome of the URI resolutions you mentioned (and the outcome of a couple of my own):
$ perl -MURI -e'
for my $rel (qw( /g //g ///g ////g h//g g////h h///g:f )) {
my $uri = URI->new($rel)->abs("http://a/b/c/d;p?q");
printf "%-20s + %-7s = %-20s host: %-4s path: %s\n",
"http://a/b/c/d;p?q", $rel, $uri, $uri->host, $uri->path;
}
for my $base (qw( http://host/a/b/c/d http://host/a/b/c//d )) {
my $uri = URI->new("../../e")->abs($base);
printf "%-20s + %-7s = %-20s host: %-4s path: %s\n",
$base, "../../e", $uri, $uri->host, $uri->path;
}
'
http://a/b/c/d;p?q + /g = http://a/g host: a path: /g
http://a/b/c/d;p?q + //g = http://g host: g path:
http://a/b/c/d;p?q + ///g = http:///g host: path: /g
http://a/b/c/d;p?q + ////g = http:////g host: path: //g
http://a/b/c/d;p?q + h//g = http://a/b/c/h//g host: a path: /b/c/h//g
http://a/b/c/d;p?q + g////h = http://a/b/c/g////h host: a path: /b/c/g////h
http://a/b/c/d;p?q + h///g:f = http://a/b/c/h///g:f host: a path: /b/c/h///g:f
http://host/a/b/c/d + ../../e = http://host/a/e host: host path: /a/e
http://host/a/b/c//d + ../../e = http://host/a/b/e host: host path: /a/b/e
Next, we'll look at the syntax of relative URIs, since that's what your question circles around.
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
segment = *pchar ; 0 or more <pchar>
segment-nz = 1*pchar ; 1 or more <pchar> nz = non-zero
The key things from these rules for answering your question:
path-absolute
) can't start with //
. The first segment, if provided, must be non-zero in length. If the relative URI starts with //
, what follows must be an authority
. //
can otherwise occur in a path because segments can have zero-length. Now, let's look at each of the resolutions you provided in turn.
/g
is an absolute path path-absolute
, and thus a valid relative URI ( relative-ref
), and thus a valid URI ( URI-reference
).
Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: undef Base.path: "/b/c/d;p" R.path: "/g" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef
Following the algorithm in §5.2.2, we get:
T.path: "/g" ; remove_dot_segments(R.path) T.query: undef ; R.query T.authority: "a" ; Base.authority T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment
Following the algorithm in §5.3, we get:
http://a/g
//g
is different. //g
isn't an absolute path ( path_absolute
) because an absolute path can't start with an empty segment ( "/" [ segment-nz *( "/" segment ) ]
).
Instead, it's follows the following pattern:
"//" authority path-abempty
Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: "g" Base.path: "/b/c/d;p" R.path: "" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef
Following the algorithm in §5.2.2, we get the following:
T.authority: "g" ; R.authority T.path: "" ; remove_dot_segments(R.path) T.query: "" ; R.query T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment
Following the algorithm in §5.3, we get the following:
http://g
Note : This contacts server g
!
///g
is similar to //g
, except the authority is blank! This is surprisingly valid.
Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: "" Base.path: "/b/c/d;p" R.path: "/g" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef
Following the algorithm in §5.2.2, we get the following:
T.authority: "" ; R.authority T.path: "/g" ; remove_dot_segments(R.path) T.query: undef ; R.query T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment
Following the algorithm in §5.3, we get the following:
http:///g
Note : While valid, this URI is useless because the server name ( T.authority
) is blank!
////g
is the same as ///g
except the R.path
is //g
, so we get
http:////g
Note : While valid, this URI is useless because the server name ( T.authority
) is blank!
The final three ( h//g
, g////h
, h///g:f
) are all relative paths ( path-noscheme
).
Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Base.scheme: "http" R.scheme: undef Base.authority: "a" R.authority: undef Base.path: "/b/c/d;p" R.path: "h//g" Base.query: "q" R.query: undef Base.fragment: undef R.fragment: undef
Following the algorithm in §5.2.2, we get the following:
T.path: "/b/c/h//g" ; remove_dot_segments(merge(Base.path, R.path)) T.query: undef ; R.query T.authority: "a" ; Base.authority T.scheme: "http" ; Base.scheme T.fragment: undef ; R.fragment
Following the algorithm in §5.3, we get the following:
http://a/b/c/h//g # For h//g http://a/b/c/g////h # For g////h http://a/b/c/h///g:f # For h///g:f
I don't think the examples are suitable for answering what I think you really want to know, though.
Take a look at the following two URIs. They aren't equivalent.
http://host/a/b/c/d # Path has 4 segments: "a", "b", "c", "d"
and
http://host/a/b/c//d # Path has 5 segments: "a", "b", "c", "", "d"
Most servers will treat them the same —which is fine since servers are free to interpret paths in any way they wish— but it makes a difference when applying relative paths. For example, if these were the base URI for ../../e
, you'd get
http://host/a/b/c/d + ../../e = http://host/a/e
and
http://host/a/b/c//d + ../../e = http://host/a/b/e
I was curious what Mojo::URL would do so I checked. There's a big caveat because it doesn't claim to be strictly compliant:
Mojo::URL implements a subset of RFC 3986, RFC 3987 and the URL Living Standard for Uniform Resource Locators with support for IDNA and IRIs.
Here's the program.
my @urls = qw(/g //g ///g ////g h//g g////h h///g:f
https:///stackoverflow.com////////a/////10161264/////6618577
);
my @parts = qw(scheme host port path query);
my $template = join "\n", map { "$_: %s" } @parts;
my $base_url = Mojo::URL->new( 'http://a/b/c/d;p?q' );
foreach my $u ( @urls ) {
my $url = Mojo::URL->new( $u )->base( $base_url )->to_abs;
no warnings qw(uninitialized);
say '-' x 40;
printf "%s\n$template", $u, map { $url->$_() } @parts
}
Here's the output:
----------------------------------------
/g
scheme: http
host: a
port:
path: /g
query: ----------------------------------------
//g
scheme: http
host: g
port:
path:
query: ----------------------------------------
///g
scheme: http
host: a
port:
path: /g
query: ----------------------------------------
////g
scheme: http
host: a
port:
path: //g
query: ----------------------------------------
h//g
scheme: http
host: a
port:
path: /b/c/h/g
query: ----------------------------------------
g////h
scheme: http
host: a
port:
path: /b/c/g/h
query: ----------------------------------------
h///g:f
scheme: http
host: a
port:
path: /b/c/h/g:f
query: ----------------------------------------
https:///stackoverflow.com////////a/////10161264/////6618577
scheme: https
host:
port:
path: /stackoverflow.com////////a/////10161264/////6618577
query:
No - ///g
would seem more equivalent to /g
. The "dot-segments" ..
and .
are what is used to navigate up and down the hierarchy with http
URLs. See also the URI module for handling paths in URIs.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.