简体   繁体   中英

how to check if url already exists in database in PHP?

I am having one scenarion where I am checking if user submitted URL is already present in database or not. My concern is user can submit the same url in different format. eg URL http://mysite.com/rahul/palake/?&test=1 & URL http://www.mysite.com/rahul/palake/?&test=1 should be considered one and the same. If I have already stored the url as http://mysite.com/rahul/palake/?&test=1 in my database then searching for url http://www.mysite.com/rahul/palake/?&test=1 in database should give me message as url already existing. For this I am using following code, the following code works for me, I want to make sure it covers all possible scenarios? or this code can be improvised?

$url="http://dev.mysite.com/rahul/palake/?&test=1";
    $parse_url=parse_url($url);

    //first check if www is present in url or not
    if(!strstr($parse_url['host'],'www'))
    {
        $scheme=trim($parse_url['scheme']);

        //assign default scheme as http if scheme is not defined
        if( $scheme =='')
            $scheme='http';

        //create new url with 'www' embeded in it
        $url1=str_replace($scheme."://",$scheme."://www.",$url);

        //now $url1 should be like this http://www.mysite.com/rahul/palake/?&test=1 

    }

    //so that $url && $url1 should be considered as one and the same
    //i.e. mysite.com/rahul/palake/?&test=1  is equivalent to  www.mysite.com/rahul/palake/?&test=1
    //should also be equivalent to http://mysite.com/rahul/palake/?&test=1

    //code to check url already exists in database goes here

    //here I will be checking if table.url like $url or table.url like $url1
    //if record found then return msg as url already exists

What about www.example.org/?one=bar&two=foo and www.example.org/?two=foo&one=bar ? they are the same URI (if normalized) but wouldn't match your regular string comparison. More examples of the same URI in different notations:

  • www.example.org/?one=bar&two=foo and www.example.org/?one=bar&&&&two=foo
  • www.example.org/#foo and www.example.org/#bar
  • www.example.org/hello/world.html and www.example.org/hello/mars/../world.html
  • www.example.org:80/ and www.example.org/
  • www.EXAMPLE.org and www.example.org/
  • www.example.org/%68%65%6c%6c%6f.html and www.example.org/hello.html

Long story short: you need to normalize the URLs before storing them in the database in order to being able to compare them later on.

I don't know any PHP library that would do this for you. I've implemented this in javascript with URI.js - maybe you can use that to get started…

You also have to consider the fact that www could well under some circumstances be any number of subdomains in a load balanced environment. so www.mysite.com could be mysite.com or www2.mysite.com etc...

I believe a url by it's very nature should be unique and it's a perfectly scaenario that the example content may be very different between www.mysite.com and mysite.com.

If your objective with this code is to prevent content duplication then I have two suggestions for a better approach:

Automated : If you think you have a potential matching URL that is not identical then by using a curl like command you could retrieve the content of both urls and hash them to determine whether they are identical (this could give you false negatives for many reasons).

Manual : Much like other content submission system, you could present the user with a list of potential matches and ask them to verify their content is indeed unique. If you went down this path I would normalise the database to store each URL with a unique ID that you can then use to link it to the entity you are currently storing. This would allow you to have many entities referring to the one URL, if this is desired behavior.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM