简体   繁体   中英

PHP URL Parsing & disecting

  • www.example.com
  • foo.example.com
  • foo.example.co.uk
  • foo.bar.example.com
  • foo.bar.example.co.uk

I've got these URL's here, and want to always end up with 2 variables:

$domainName = "example"
$domainNameSuffix = ".com" OR ".co.uk"

If I someone could get me from $url being one of the urls, all the way down to $newUrl being close to "example.co.uk", it would be a blessing.

Note that the urls are going to be completely "random", we might end up having "foo.bar.example2.com.au" too, so ... you know... ugh. (asking for the impossible?)

Cheers,

We had a few questions like this before, but I can't find a good one right now either. The crux is, this cannot be done reliably. You would need a long list of special TLDs (like .uk and .au) which have their own .com/.net level.

But as general approach and simple solution you could use:

preg_match('#([\w-]+)\.(\w+(\.(au|uk))?)\.?$#i', $domain, $m);
list(, $domain, $suffix) = $m;

You will need to maintain a list of extensions for most accurate results I believe.

$possibleExtensions = array(
    '.com',
    '.co.uk',
    '.com.au'
);

// parse_url() needs a protocol.
$str = 'http://' . $str;

// Use parse_url() to take into account any paths
// or fragments that may end up being there.
$host = parse_url($str, PHP_URL_HOST);

foreach($possibleExtensions as $ext) {

    if (preg_match('/' . preg_quote($ext, '/') . '\Z/', $host)) {
       $domainNameSuffix = $ext;
       // Strip extension     
       $domainName = substr($str, 0, -strlen($ext));
       // Strip off http://           
       $domainName = substr($domainName, 7);
       var_dump($domainName, $domainNameSuffix);
       break;

    }

}

If you never have any paths or extra stuff, you can of course skip the parse_url() and the http:// adding and removal.

It worked for all your tests .

The "domainNameSuffix" is called a top level domain (tld for short) , and there is no easy way to extract it.

Every country has it's own tld, and some countries have opted to further subdivide their tld. And since the number of subdomains (my.own.subdomain.example.com) is also variable, there is no easy "one-regexp-fits-all".

As mentioned, you need a list. Fortunately for you there are lists publicly available: http://publicsuffix.org/

There isn't a builtin function for this.

A quick google search lead me to http://www.wallpaperama.com/forums/php-function-remove-domain-name-get-tld-splitter-split-t5824.html

This leads me to believe you need to maintain a list of valid TLD's to split URLs on.

Alright chaps, here's how I solved it, for now. Implementation of more domain names will be done as well, at some point in the future. Don't know what technique I'll use, yet.

# Setting options, single and dual part domain extentions
$v2_onePart = array(
                "com"
                );
$v2_twoPart = array(
                "co.uk",
                "com.au"
                );

$v2_url         = $_SERVER['SERVER_NAME'];      # "example.com"     OR  "example.com.au"
$v2_bits        = explode(".", $v2_url);        # "example", "com"  OR  "example", "com", "au"
$v2_bits        = array_reverse($v2_bits);      # "com", "example"  OR  "au", "com", "example"      (Reversing to eliminate foo.bar.example.com.au problems.)

switch ($v2_bits) {
    case in_array($v2_bits[1] . "." . $v2_bits[0], $v2_twoPart):
        $v2_class   = $v2_bits[2] . " " . $v2_bits[1] . "_" . $v2_bits[0];  # "example com_au"
        break;
    case in_array($v2_bits[0], $v2_onePart):
        $v2_class   = $v2_bits[1] . " " . $v2_bits[0];  # "example com"
        break;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM