[英]Make sure url is not repeated
I am building a website where people can submit their blog addresses. 我正在建立一个网站,人们可以在其中提交其博客地址。 What I'm trying to do is that when they submit a blog, for me to check the database to see if it's already in the database. 我想做的是,当他们提交博客时,让我检查数据库以查看它是否已存在于数据库中。
The problem that I have is that somebody can write the urls as "http://blog.com" or "http://www.blog.com" . 我的问题是有人可以将网址写为“ http://blog.com”或“ http://www.blog.com”。
What would be the best way for me to check if the url is repeated? 对我来说,检查网址是否重复的最佳方法是什么?
What I think is I would check if the url has a "http://" and a "www", and check for the part after "www" but I feel this would be slow because I have more than 3000 urls. 我想是,我会检查网址中是否包含“ http://”和“ www”,并检查“ www”之后的部分,但是我觉得这很慢,因为我有3000多个网址。 Thanks! 谢谢!
www.blog.com
and blog.com
may or may not be two entirely different blogs. www.blog.com
和blog.com
可能是也可能不是两个完全不同的博客。 For example, example.blogspot.com
and blogspot.com
are two entirely different sites. 例如, example.blogspot.com
和blogspot.com
是两个完全不同的网站。 www.
is just a normal subdomain like any other and there's no rule on how it should behave. 只是一个普通的子域,与其他任何子域一样,没有规则。 The same goes for the path following the domain; 域之后的路径也是如此。 example.com/blorg
and example.com/foobarg
may be two independent blogs. example.com/blorg
和example.com/foobarg
可能是两个独立的博客。
Therefore, you want to make an HTTP request to the given URL and see if it redirects somewhere. 因此,您想向给定的URL发出HTTP请求,并查看它是否重定向到某处。 Typically there is one canonical URL, and www.blog.com
redirects to blog.com
or the other way around. 通常会有一个规范的URL,而www.blog.com
重定向到blog.com
或以其他方式重定向。 So dig into the curl extension or any other favorite HTTP request module to make a request to the given URL and figure out which canonical URL it resolves to. 因此,请深入学习curl扩展或其他任何喜欢的HTTP请求模块,以对给定的URL进行请求,并找出它解析为哪个规范的URL。
You may also want to parse the entire URL using parse_url
and only take the, for instance, hostname and path together as the unique identifier, ignoring other irregularities like the scheme or query parameters. 您可能还想使用parse_url
解析整个URL,并且仅将例如主机名和路径作为唯一标识符,而忽略诸如方案或查询参数之类的其他不规则性。
I would create an Url object which implements some compare interface (c#). 我将创建一个实现一些比较接口(c#)的Url对象。
So you can do it like this. 因此,您可以这样做。
var url = new Url("http://www.someblog.nl");
var url2 = new Url("http://someblog.nl");
if (url == url2)
{
throw new UrlNeedsToBeUniqueException();
}
You can implement the compare function with some regex or just always strip the www. 您可以使用某些正则表达式来实现compare函数,也可以始终剥离www。 part from the url with a string replace before you start to compare. 在开始比较之前,用字符串替换URL中的部分。
Dis-calmer : This is for experimental purpose, it suppose to guide you on the best format you want to use Dis-calmer:这是出于实验目的,它旨在指导您选择要使用的最佳格式
I think you should save only the domain and sub domain .. I would demonstrate what i mean by this simple script 我认为您应该只保存域和子域..我将演示此简单脚本的含义
Image An array 图像数组
$urls = array('http://blog.com',
'http://somethingelse.blog.com',
'http://something1.blog.com',
'ftp://blog.com',
'https://blog.com',
'http://www.blog.com',
'http://www.blog.net',
'blog.com',
'somethingelse.blog.com');
If you run 如果你跑
$found = array();
$blogUrl = new BlogURL();
foreach ( $urls as $url ) {
$domain = $blogUrl->parse($url);
if (! $domain) {
$blogUrl->log("#Parse can't parse $url");
continue;
}
$key = array_search($domain, $found);
if ($key !== false) {
$blogUrl->log("#Duplicate $url same as {$found[$key]}");
continue;
}
$found[] = $domain;
$blogUrl->log("#new $url has $domain");
}
var_dump($found);
Output 输出量
array
0 => string 'blog.com' (length=8)
1 => string 'somethingelse.blog.com' (length=22)
2 => string 'something1.blog.com' (length=19)
3 => string 'blog.net' (length=8)
If you want to see inner working 如果您想了解内部工作
echo "<pre>";
echo implode(PHP_EOL, $blogUrl->getOutput());
Output 输出量
blog.com Found in http://blog.com
#new http://blog.com has blog.com
somethingelse.blog.com Found in http://somethingelse.blog.com
#new http://somethingelse.blog.com has somethingelse.blog.com
something1.blog.com Found in http://something1.blog.com
#new http://something1.blog.com has something1.blog.com
#error domain not found in ftp://blog.com
#Parse can't parse ftp://blog.com
blog.com Found in https://blog.com
#Duplicate https://blog.com same as blog.com
www.blog.com Found in http://www.blog.com
#Duplicate http://www.blog.com same as blog.com
www.blog.net Found in http://www.blog.net
#new http://www.blog.net has blog.net
#Fixed blog.com to
#Fixed http://blog.com to http://blog.com
blog.com Found in http://blog.com
#Duplicate blog.com same as blog.com
#Fixed somethingelse.blog.com to
#Fixed http://somethingelse.blog.com to http://somethingelse.blog.com
somethingelse.blog.com Found in http://somethingelse.blog.com
#Duplicate somethingelse.blog.com same as somethingelse.blog.com
Class Used 使用的类
class BlogURL {
private $output;
function parse($url) {
if (! preg_match("~^(?:f|ht)tps?://~i", $url)) {
$this->log("#Fixed $url to ");
$url = "http://" . $url;
$this->log("#Fixed $url to $url");
}
if (! filter_var($url, FILTER_VALIDATE_URL)) {
$this->log("#Error $url not valid");
return false;
}
preg_match('!https?://(\S+)+!', $url, $matches);
$domain = isset($matches[1]) ? $matches[1] : null;
if (! $domain) {
$this->log("#error domain not found in $url");
return false;
}
$this->log($domain . " Found in $url");
return ltrim($domain, "w.");
}
function log($var = PHP_EOL) {
$this->output[] = $var;
}
function getOutput() {
return $this->output;
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.