简体   繁体   中英

Make sure url is not repeated

I am building a website where people can submit their blog addresses. What I'm trying to do is that when they submit a blog, for me to check the database to see if it's already in the database.

The problem that I have is that somebody can write the urls as "http://blog.com" or "http://www.blog.com" .

What would be the best way for me to check if the url is repeated?

What I think is I would check if the url has a "http://" and a "www", and check for the part after "www" but I feel this would be slow because I have more than 3000 urls. Thanks!

www.blog.com and blog.com may or may not be two entirely different blogs. For example, example.blogspot.com and blogspot.com are two entirely different sites. www. is just a normal subdomain like any other and there's no rule on how it should behave. The same goes for the path following the domain; example.com/blorg and example.com/foobarg may be two independent blogs.

Therefore, you want to make an HTTP request to the given URL and see if it redirects somewhere. Typically there is one canonical URL, and www.blog.com redirects to blog.com or the other way around. So dig into the curl extension or any other favorite HTTP request module to make a request to the given URL and figure out which canonical URL it resolves to.

You may also want to parse the entire URL using parse_url and only take the, for instance, hostname and path together as the unique identifier, ignoring other irregularities like the scheme or query parameters.

I would create an Url object which implements some compare interface (c#).

So you can do it like this.

 var url = new Url("http://www.someblog.nl");
 var url2 = new Url("http://someblog.nl");

if (url == url2)
{
    throw new UrlNeedsToBeUniqueException();
}

You can implement the compare function with some regex or just always strip the www. part from the url with a string replace before you start to compare.

Dis-calmer : This is for experimental purpose, it suppose to guide you on the best format you want to use

I think you should save only the domain and sub domain .. I would demonstrate what i mean by this simple script

Image An array

$urls = array('http://blog.com',
        'http://somethingelse.blog.com',
        'http://something1.blog.com',
        'ftp://blog.com',
        'https://blog.com',
        'http://www.blog.com',
        'http://www.blog.net',
        'blog.com',
        'somethingelse.blog.com');

If you run

$found = array();
$blogUrl = new BlogURL();
foreach ( $urls as $url ) {
    $domain = $blogUrl->parse($url);
    if (! $domain) {
        $blogUrl->log("#Parse can't parse  $url");
        continue;
    }

    $key = array_search($domain, $found);

    if ($key !== false) {
        $blogUrl->log("#Duplicate $url same as {$found[$key]}");
        continue;
    }

    $found[] = $domain;
    $blogUrl->log("#new $url has  $domain");
}

var_dump($found);

Output

array
  0 => string 'blog.com' (length=8)
  1 => string 'somethingelse.blog.com' (length=22)
  2 => string 'something1.blog.com' (length=19)
  3 => string 'blog.net' (length=8)

If you want to see inner working

echo "<pre>";
echo implode(PHP_EOL, $blogUrl->getOutput());

Output

blog.com Found in http://blog.com
#new http://blog.com has  blog.com
somethingelse.blog.com Found in http://somethingelse.blog.com
#new http://somethingelse.blog.com has  somethingelse.blog.com
something1.blog.com Found in http://something1.blog.com
#new http://something1.blog.com has  something1.blog.com
#error domain not found in ftp://blog.com
#Parse can't parse  ftp://blog.com
blog.com Found in https://blog.com
#Duplicate https://blog.com same as blog.com
www.blog.com Found in http://www.blog.com
#Duplicate http://www.blog.com same as blog.com
www.blog.net Found in http://www.blog.net
#new http://www.blog.net has  blog.net
#Fixed blog.com to 
#Fixed http://blog.com to http://blog.com
blog.com Found in http://blog.com
#Duplicate blog.com same as blog.com
#Fixed somethingelse.blog.com to 
#Fixed http://somethingelse.blog.com to http://somethingelse.blog.com
somethingelse.blog.com Found in http://somethingelse.blog.com
#Duplicate somethingelse.blog.com same as somethingelse.blog.com

Class Used

class BlogURL {
    private $output;

    function parse($url) {
        if (! preg_match("~^(?:f|ht)tps?://~i", $url)) {
            $this->log("#Fixed $url to ");
            $url = "http://" . $url;
            $this->log("#Fixed $url to $url");
        }

        if (! filter_var($url, FILTER_VALIDATE_URL)) {
            $this->log("#Error $url not valid");
            return false;
        }
        preg_match('!https?://(\S+)+!', $url, $matches);
        $domain = isset($matches[1]) ? $matches[1] : null;

        if (! $domain) {
            $this->log("#error domain not found in $url");
            return false;
        }
        $this->log($domain . " Found in $url");

        return ltrim($domain, "w.");
    }

    function log($var = PHP_EOL) {
        $this->output[] = $var;
    }

    function getOutput() {
        return $this->output;
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM