简体   繁体   中英

Regex for parsing url PHP

I need to find whether a given url is valid or not, the scenario is it should be allowed if it contains urls haviing

1.Generic top-level domains 2.Country code top-level domains refer below url http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

I need to do this in PHP, this is currently what I am doing

$regexUrl = "((https?|ftp)\:\/\/)?"; // SCHEME 
    $regexUrl .= "([a-zA-Z0-9+!*(),;?&=\$_.-]+(\:[a-zA-Z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass 
    $regexUrl .= "([a-zA-Z0-9-]+)\.([a-zA-Z]{2,3})";  // Host or IP 
    $regexUrl .= "(\:[0-9]{2,5})?"; // Port 
    $regexUrl .= "(\/([a-zA-Z0-9+\$_-]\.?)+)*\/?"; // Path 
    $regexUrl .= "(\?[a-zA-Z+&\$_.-][a-zA-Z0-9;:@&%=+\/\$_.-]*)?"; // GET Query 
    $regexUrl .= "(#[a-zA-Z_.-][a-zA-Z0-9+\$_.-]*)?"; // Anchor 
    //if(preg_match_all("#\bhttps?://[^\s()]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#", $message, $matches1, PREG_PATTERN_ORDER))
    //$pattern = '/((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+)(\.)(com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|[a-z]{2}))([\/][\/a-zA-Z0-9\.]*)*([\/]?(([\?][a-zA-Z0-9]+[\=][a-zA-Z0-9\%\(\)]*)([\&][a-zA-Z0-9]+[\=][a-zA-Z0-9\%\(\)]*)*))?/';
    if(preg_match_all("/$regexUrl/", $urlMessage, $matches1, PREG_PATTERN_ORDER))
    {
      try
        {
            foreach($matches1[0] as $urlToTrim1)
            {
                $url= $urlToTrim1;
                echo $url;
            }
        }
        catch(Exception $e)
        {
            $url="-1";
        }
    }

To figure out if it's generally a valid URL:

filter_var($url, FILTER_VALIDATE_URL)

http://www.php.net/manual/en/function.filter-var.php

If you want to confirm that the TLD is in your approved list (I don't know if filter_var goes so far as to check whether a TLD actually exists):

$host = parse_url($url, PHP_URL_HOST);
$tld = substr($host, strrpos($host, '.') + 1);

// check if $tld is in a list of allowed TLDs

Or simply try to look up the DNS record of the domain using gethostbyname . If one exists, it's a valid domain.*


* Unless you're being DNS spoofed, if that case is important to you...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM