简体   繁体   English

确保网址不重复

[英]Make sure url is not repeated

I am building a website where people can submit their blog addresses. 我正在建立一个网站,人们可以在其中提交其博客地址。 What I'm trying to do is that when they submit a blog, for me to check the database to see if it's already in the database. 我想做的是,当他们提交博客时,让我检查数据库以查看它是否已存在于数据库中。

The problem that I have is that somebody can write the urls as "http://blog.com" or "http://www.blog.com" . 我的问题是有人可以将网址写为“ http://blog.com”或“ http://www.blog.com”。

What would be the best way for me to check if the url is repeated? 对我来说,检查网址是否重复的最佳方法是什么?

What I think is I would check if the url has a "http://" and a "www", and check for the part after "www" but I feel this would be slow because I have more than 3000 urls. 我想是,我会检查网址中是否包含“ http://”和“ www”,并检查“ www”之后的部分,但是我觉得这很慢,因为我有3000多个网址。 Thanks! 谢谢!

www.blog.com and blog.com may or may not be two entirely different blogs. www.blog.comblog.com 可能是也可能不是两个完全不同的博客。 For example, example.blogspot.com and blogspot.com are two entirely different sites. 例如, example.blogspot.comblogspot.com是两个完全不同的网站。 www. is just a normal subdomain like any other and there's no rule on how it should behave. 只是一个普通的子域,与其他任何子域一样,没有规则。 The same goes for the path following the domain; 域之后的路径也是如此。 example.com/blorg and example.com/foobarg may be two independent blogs. example.com/blorgexample.com/foobarg可能是两个独立的博客。

Therefore, you want to make an HTTP request to the given URL and see if it redirects somewhere. 因此,您想向给定的URL发出HTTP请求,并查看它是否重定向到某处。 Typically there is one canonical URL, and www.blog.com redirects to blog.com or the other way around. 通常会有一个规范的URL,而www.blog.com重定向到blog.com或以其他方式重定向。 So dig into the curl extension or any other favorite HTTP request module to make a request to the given URL and figure out which canonical URL it resolves to. 因此,请深入学习curl扩展或其他任何喜欢的HTTP请求模块,以对给定的URL进行请求,并找出它解析为哪个规范的URL。

You may also want to parse the entire URL using parse_url and only take the, for instance, hostname and path together as the unique identifier, ignoring other irregularities like the scheme or query parameters. 您可能还想使用parse_url解析整个URL,并且仅将例如主机名和路径作为唯一标识符,而忽略诸如方案或查询参数之类的其他不规则性。

I would create an Url object which implements some compare interface (c#). 我将创建一个实现一些比较接口(c#)的Url对象。

So you can do it like this. 因此,您可以这样做。

 var url = new Url("http://www.someblog.nl");
 var url2 = new Url("http://someblog.nl");

if (url == url2)
{
    throw new UrlNeedsToBeUniqueException();
}

You can implement the compare function with some regex or just always strip the www. 您可以使用某些正则表达式来实现compare函数,也可以始终剥离www。 part from the url with a string replace before you start to compare. 在开始比较之前,用字符串替换URL中的部分。

Dis-calmer : This is for experimental purpose, it suppose to guide you on the best format you want to use Dis-calmer:这是出于实验目的,它旨在指导您选择要使用的最佳格式

I think you should save only the domain and sub domain .. I would demonstrate what i mean by this simple script 我认为您应该只保存域和子域..我将演示此简单脚本的含义

Image An array 图像数组

$urls = array('http://blog.com',
        'http://somethingelse.blog.com',
        'http://something1.blog.com',
        'ftp://blog.com',
        'https://blog.com',
        'http://www.blog.com',
        'http://www.blog.net',
        'blog.com',
        'somethingelse.blog.com');

If you run 如果你跑

$found = array();
$blogUrl = new BlogURL();
foreach ( $urls as $url ) {
    $domain = $blogUrl->parse($url);
    if (! $domain) {
        $blogUrl->log("#Parse can't parse  $url");
        continue;
    }

    $key = array_search($domain, $found);

    if ($key !== false) {
        $blogUrl->log("#Duplicate $url same as {$found[$key]}");
        continue;
    }

    $found[] = $domain;
    $blogUrl->log("#new $url has  $domain");
}

var_dump($found);

Output 输出量

array
  0 => string 'blog.com' (length=8)
  1 => string 'somethingelse.blog.com' (length=22)
  2 => string 'something1.blog.com' (length=19)
  3 => string 'blog.net' (length=8)

If you want to see inner working 如果您想了解内部工作

echo "<pre>";
echo implode(PHP_EOL, $blogUrl->getOutput());

Output 输出量

blog.com Found in http://blog.com
#new http://blog.com has  blog.com
somethingelse.blog.com Found in http://somethingelse.blog.com
#new http://somethingelse.blog.com has  somethingelse.blog.com
something1.blog.com Found in http://something1.blog.com
#new http://something1.blog.com has  something1.blog.com
#error domain not found in ftp://blog.com
#Parse can't parse  ftp://blog.com
blog.com Found in https://blog.com
#Duplicate https://blog.com same as blog.com
www.blog.com Found in http://www.blog.com
#Duplicate http://www.blog.com same as blog.com
www.blog.net Found in http://www.blog.net
#new http://www.blog.net has  blog.net
#Fixed blog.com to 
#Fixed http://blog.com to http://blog.com
blog.com Found in http://blog.com
#Duplicate blog.com same as blog.com
#Fixed somethingelse.blog.com to 
#Fixed http://somethingelse.blog.com to http://somethingelse.blog.com
somethingelse.blog.com Found in http://somethingelse.blog.com
#Duplicate somethingelse.blog.com same as somethingelse.blog.com

Class Used 使用的类

class BlogURL {
    private $output;

    function parse($url) {
        if (! preg_match("~^(?:f|ht)tps?://~i", $url)) {
            $this->log("#Fixed $url to ");
            $url = "http://" . $url;
            $this->log("#Fixed $url to $url");
        }

        if (! filter_var($url, FILTER_VALIDATE_URL)) {
            $this->log("#Error $url not valid");
            return false;
        }
        preg_match('!https?://(\S+)+!', $url, $matches);
        $domain = isset($matches[1]) ? $matches[1] : null;

        if (! $domain) {
            $this->log("#error domain not found in $url");
            return false;
        }
        $this->log($domain . " Found in $url");

        return ltrim($domain, "w.");
    }

    function log($var = PHP_EOL) {
        $this->output[] = $var;
    }

    function getOutput() {
        return $this->output;
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用php确保csv文件中没有单词重复? - How to make sure that no word is repeated in a csv file using php? 当站点不在根域上时,如何确保相对URL有效? - How to make sure relative URL works when site is not on root domain? 如何确保始终只使用干净的URL? - How to make sure only clean URL is used all the time? 确保调用了函数 - Make sure that functions are called 如何确保从贝宝服务器调用贝宝notify_url? - How to make sure the paypal notify_url being called from paypal server? 如何确保网址中始终存在参数“ lang”而不将其添加到所有链接? - How to make sure a parameter “lang” always is present in url without adding it to all links? 如何确保人类不会查看PHP脚本URL的结果? - How to make sure a human doesn't view the results from a PHP script URL? 有没有更有效的方法来确保随机列表不包含URL的一部分? - Is there a more efficient way to make sure a randomized list doesn't contain a portion of the URL? CakePHP 自定义路线 w/ %20 在 url 想要替换 w/ “-” 不确定如何在路线中发生这种情况 - CakePHP custom route w/ %20 in url want to replace w/ “-” not sure how to make this happen in the route URL中重复调用目录路径 - repeated calling directory path in URL
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM