I am attempting to validate using PHP's filter_var()
extension. Per http://php.net/manual/en/filter.filters.validate.php :
Validates value as URL (according to » http://www.faqs.org/rfcs/rfc2396 ), optionally with required components. Beware a valid URL may not specify the HTTP protocol http:// so further validation may be required to determine the URL uses an expected protocol, eg ssh:// or mailto:. Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail.
In regards to, Beware a valid URL may not specify the HTTP protocol , my tests below indicate that a HTTP protocol is required ( URL 'stackoverflow.com/' is NOT considered valid.
). How am I misinterpreting the documentation?
Also, how are URLs such as https://https://stackoverflow.com/ prevented from validating true?
PS. Any comments on my approach of sanitizing the protocol would be appreciated.
<?php
function filterURL($url) {
echo("URL '{$url}' is ".(filter_var($url, FILTER_VALIDATE_URL)?'':' NOT ').'considered valid.<br>');
}
function sanitizeURL($url) {
return (strtolower(substr($url,0,7))=='http://' || strtolower(substr($url,0,8))=='https://')?$url:'http://'.$url;
}
filterURL('http://stackoverflow.com/');
filterURL('https://stackoverflow.com/');
filterURL('//stackoverflow.com/');
filterURL('stackoverflow.com/');
filterURL(sanitizeURL('http://stackoverflow.com/'));
filterURL(sanitizeURL('https://stackoverflow.com/'));
filterURL(sanitizeURL('stackoverflow.com/'));
filterURL('https://https://stackoverflow.com/');
?>
OUTPUT:
URL 'http://stackoverflow.com/' is considered valid.
URL 'https://stackoverflow.com/' is considered valid.
URL '//stackoverflow.com/' is NOT considered valid.
URL 'stackoverflow.com/' is NOT considered valid.
URL 'http://stackoverflow.com/' is considered valid.
URL 'https://stackoverflow.com/' is considered valid.
URL 'http://stackoverflow.com/' is considered valid.
URL 'https://https://stackoverflow.com/' is considered valid.
FILTER_VALIDATE_URL
uses parse_url()
, which unfortunatelly parses 'https://https://'
as a valid URL (as it is really a valid one considering URIs RFC):
var_dump(parse_url('https://https://stackoverflow.com/'));
array(3) {
["scheme"]=> string(5) "https"
["host"]=> string(5) "https"
["path"]=> string(20) "//stackoverflow.com/"
}
You could change your sanitazeURL
function into:
function sanitizeURL($url) {
return (parse_url($url, PHP_URL_SCHEME)) ? $url : 'http://' . $url;
}
but still you have to check whether host name is not http
nor https
:
function filterURL($url) {
echo("URL '{$url}' is ".((filter_var($url, FILTER_VALIDATE_URL) !== false && (parse_url($url, PHP_URL_HOST) !== 'http' && parse_url($url, PHP_URL_HOST) !== 'https'))?'':' NOT ').'considered valid.<br>');
}
You can remove the http or add it by validation it exist or not.
<?php
$url = "http://www.nigeriatest.com";
// Remove all illegal characters from a url
$url = filter_var($url, FILTER_SANITIZE_URL);
// Validate url
if (!filter_var($url, FILTER_VALIDATE_URL) === false) {
echo("$url is a valid URL");
} else {
echo("$url is not a valid URL");
}
?>
How am I misinterpreting the documentation?
The specification doesn't say anything about not having a protocol - it simply states that the protocol might not be HTTP.
You chop of the important piece of the sentence in your quote...
Beware a valid URL may not specify the HTTP protocol http:// so further validation may be required to determine the URL uses an expected protocol
A protocol is expected , it may or may not be HTTP.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.