简体   繁体   English

正则表达式匹配通用 URL

[英]Regular expression to match generic URL

I've looked all over and have yet to find a single solution to address my need for a regular expression pattern that will match a generic URL.我已经查看了所有内容,但尚未找到一个解决方案来满足我对匹配通用 URL 的正则表达式模式的需求。 I need to support multiple protocols (with verification), localhost and/or IP addressing, ports and query strings.我需要支持多种协议(通过验证)、localhost 和/或 IP 寻址、端口和查询字符串。 Some examples:一些例子:

Ideally, I'd like the pattern to also support extracting the various elements (protocol, host, port, query string, etc.) but this is not a requirement.理想情况下,我希望该模式还支持提取各种元素(协议、主机、端口、查询字符串等),但这不是必需的。

(Also, for the purposes of myself and future readers, if you could explain the pattern, it would be helpful.) (另外,对于我自己和未来的读者来说,如果你能解释一下这个模式,那将会很有帮助。)

Appendix B of RFC 3986/STD 0066 ( Uniform Resource Identifier (URI): Generic Syntax ) provides the regular expression you need: RFC 3986/STD 0066的附录 B(统一资源标识符 (URI):通用语法)提供了您需要的正则表达式:

Appendix B. Parsing a URI Reference with a Regular Expression附录 B. 使用正则表达式解析 URI 引用

As the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential five components of a URI reference.由于“first-match-wins”算法与 POSIX 正则表达式使用的“贪婪”消歧方法相同,因此使用正则表达式来解析 URI 引用的潜在五个组件是很自然和常见的。

The following line is the regular expression for breaking-down a well-formed URI reference into its components.以下行是将格式良好的 URI 引用分解为其组件的正则表达式。

 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9

The numbers in the second line above are only to assist readability;上面第二行中的数字只是为了便于阅读; they indicate the reference points for each subexpression (ie, each paired parenthesis).它们指示每个子表达式的参考点(即每个成对的括号)。 We refer to the value matched for subexpression <n> as $<n> .我们将与子表达式<n>匹配的值称为$<n> For example, matching the above expression to例如,将上面的表达式匹配到

 http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:导致以下子表达式匹配:

 $1 = http: $2 = http $3 = //www.ics.uci.edu $4 = www.ics.uci.edu $5 = /pub/ietf/uri/ $6 = <undefined> $7 = <undefined> $8 = #Related $9 = Related

where <undefined> indicates that the component is not present, as is the case for the query component in the above example.其中<undefined>表示该组件不存在,如上例中查询组件的情况。 Therefore, we can determine the value of the five components as因此,我们可以将五个分量的值确定为

 scheme = $2 authority = $4 path = $5 query = $7 fragment = $9

Going in the opposite direction, we can recreate a URI reference from its components by using the algorithm of Section 5.3.反过来,我们可以使用 5.3 节的算法从它的组件重新创建一个 URI 引用。

As for validating a URI against a particular scheme goes, you'll need to look at the RFC(s) describing the scheme(s) in which you are interested to get the detail required to validate that a URI is valid for the scheme it purports to be.至于针对特定方案验证 URI,您需要查看描述您感兴趣的方案的 RFC(s),以获取验证 URI 对其方案有效所需的详细信息声称是。 The URI scheme registry is located at http://www.iana.org/assignments/uri-schemes.html . URI 方案注册表位于http://www.iana.org/assignments/uri-schemes.html

And even then, you're doomed to some sort of failure.即便如此,你也注定要失败。 Consider the file: scheme.考虑file: scheme. You can't validate that it represents a valid path in the file system of the authority (unless you are the authority).您无法验证它是否代表authority文件系统中的有效路径(除非您授权)。 The best that you can do is validate that it represents something that looks like a valid path.您可以做的最好的事情是验证它是否代表看起来像有效路径的东西。 And even then, a windows file: url like file:///C:/foo/bar/baz/bat.txt is (would be) invalid for anything but a server running some flavor of Windows.即便如此,windows 文件:url 之类的file:///C:/foo/bar/baz/bat.txt对于除了运行某种 ZAEA23489CE3AA9B6406.EBB28 风格的服务器之外的任何东西(可能)都是无效的Any server running *nix would likely choke on it (what's a drive letter anyway?).任何运行 *nix 的服务器都可能会被它阻塞(无论如何,什么是驱动器号?)。

Nicholas Carey is correct to steer you towards RFC-3986. Nicholas Carey 正确地引导您使用 RFC-3986。 The regex he points out will match a generic URI, but it will not validate it (and this regex is not good for picking URLs out of "the wild" - it is too loose and matches just about any string including an empty string).他指出的正则表达式将匹配一个通用 URI,但它不会验证它(而且这个正则表达式不适合从“野外”中挑选 URL - 它太松散并且几乎匹配任何字符串,包括空字符串)。

Regarding the validation requirement, you may want to take a look at an article I wrote on the subject, which takes from Appendix A all the ABNF syntax definitions of all the various components and provides regex equivalents:关于验证要求,你可能想看看我写的一篇关于这个主题的文章,它从附录 A 中获取了所有不同组件的所有 ABNF 语法定义,并提供了正则表达式等价物:

Regular Expression URI Validation正则表达式 URI 验证

Regarding the subject of picking out URL's from the "wild", take a look at Jeff Atwood's " The Problem With URLs " and John' Gruber's " An Improved Liberal, Accurate Regex Pattern for Matching URLs " blog posts to get a glimpse as to some of the subtle problems which can arise.关于从“狂野”中挑选 URL 的主题,请查看 Jeff Atwood 的“ URL 问题”和 John' Gruber 的“用于匹配 URL 的改进的自由、准确的正则表达式模式”博客文章,以了解一些可能出现的微妙问题。 Also, you may want to take a look at a project I started last year: URL Linkification - this picks out unlinked HTTP and FTP URLs from text which may already have some links.另外,你可能想看看我去年开始的一个项目: URL Linkification - 这会挑选出未链接的 HTTP 和 FTP 可能已经有一些链接的 URL

That said, the following is a PHP function which uses a slightly modified version of the RFC-3986 "Absolute URI" regex to validate HTTP and FTP URL's (with this regex, the named host portion must not be empty). That said, the following is a PHP function which uses a slightly modified version of the RFC-3986 "Absolute URI" regex to validate HTTP and FTP URL's (with this regex, the named host portion must not be empty). All the various components of the URI are isolated and captured into named groups which allows for easy manipulation and validation of the parts within the program code: URI 的所有各种组件都被隔离并捕获到命名组中,从而可以轻松地对程序代码中的部分进行操作和验证:

function url_valid($url)
{
    if (strpos($url, 'www.') === 0) $url = 'http://'. $url;
    if (strpos($url, 'ftp.') === 0) $url = 'ftp://'. $url;
    if (!preg_match('/# Valid absolute URI having a non-empty, valid DNS host.
        ^
        (?P<scheme>[A-Za-z][A-Za-z0-9+\-.]*):\/\/
        (?P<authority>
          (?:(?P<userinfo>(?:[A-Za-z0-9\-._~!$&\'()*+,;=:]|%[0-9A-Fa-f]{2})*)@)?
          (?P<host>
            (?P<IP_literal>
              \[
              (?:
                (?P<IPV6address>
                  (?:                                                (?:[0-9A-Fa-f]{1,4}:){6}
                  |                                                ::(?:[0-9A-Fa-f]{1,4}:){5}
                  | (?:                          [0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){2}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}:
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::
                  )
                  (?P<ls32>[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}
                  | (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                       (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
                  )
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?::
                )
              | (?P<IPvFuture>[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&\'()*+,;=:]+)
              )
              \]
            )
          | (?P<IPv4address>(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                               (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))
          | (?P<regname>(?:[A-Za-z0-9\-._~!$&\'()*+,;=]|%[0-9A-Fa-f]{2})+)
          )
          (?::(?P<port>[0-9]*))?
        )
        (?P<path_abempty>(?:\/(?:[A-Za-z0-9\-._~!$&\'()*+,;=:@]|%[0-9A-Fa-f]{2})*)*)
        (?:\?(?P<query>       (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        (?:\#(?P<fragment>    (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        $
        /mx', $url, $m)) return FALSE;
    switch ($m['scheme'])
    {
    case 'https':
    case 'http':
        if ($m['userinfo']) return FALSE; // HTTP scheme does not allow userinfo.
        break;
    case 'ftps':
    case 'ftp':
        break;
    default:
        return FALSE;   // Unrecognised URI scheme. Default to FALSE.
    }
    // Validate host name conforms to DNS "dot-separated-parts".
    if ($m{'regname'}) // If host regname specified, check for DNS conformance.
    {
        if (!preg_match('/# HTTP DNS host name.
            ^                      # Anchor to beginning of string.
            (?!.{256})             # Overall host length is less than 256 chars.
            (?:                    # Group dot separated host part alternatives.
              [0-9A-Za-z]\.        # Either a single alphanum followed by dot
            |                      # or... part has more than one char (63 chars max).
              [0-9A-Za-z]          # Part first char is alphanum (no dash).
              [\-0-9A-Za-z]{0,61}  # Internal chars are alphanum plus dash.
              [0-9A-Za-z]          # Part last char is alphanum (no dash).
              \.                   # Each part followed by literal dot.
            )*                     # One or more parts before top level domain.
            (?:                    # Explicitly specify top level domains.
              com|edu|gov|int|mil|net|org|biz|
              info|name|pro|aero|coop|museum|
              asia|cat|jobs|mobi|tel|travel|
              [A-Za-z]{2})         # Country codes are exqactly two alpha chars.
            $                      # Anchor to end of string.
            /ix', $m['host'])) return FALSE;
    }
    $m['url'] = $url;
    for ($i = 0; isset($m[$i]); ++$i) unset($m[$i]);
    return $m; // return TRUE == array of useful named $matches plus the valid $url.
}

The first regex validates the string as an absolute (has a non-empty host portion) generic URI.第一个正则表达式将字符串验证为绝对(具有非空主机部分)通用 URI。 A second regex is used to validate the (named) host portion (when it is not an IP literal or IPv4 address) with regard to the DNS lookup system (where each dot-separated subdomain is 63 chars or less consisting of digits, letters and dashes, with an overall length less than 255 chars.)第二个正则表达式用于验证关于 DNS 查找系统的(命名的)主机部分(当它不是 IP 文字或 IPv4 地址时)(其中每个点分隔的子域为 63 个字符或更少,由数字、字母和破折号,总长度小于 255 个字符。)

Note that the structure of this function allows easy expansion to include other schemes.请注意,此 function 的结构允许轻松扩展以包含其他方案。

Would this be in Perl by any chance?这是否会出现在 Perl 中?

Try:尝试:

use strict;
my $url = "http://localhost/test";
if ($url =~ m/^(.+):\/\/(.+)\/(.+)/) {
    my $protocol = $1;
    my $domain = $2;
    my $dir = $3;

    print "$protocol $domain $dir \n";
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM