简体   繁体   中英

What's the safest way to determine if 2 URLs are the same?

If I have URL A say http://www.example.com/ and another, say http://www.example.com . What would be the safest way to determine if both is the same, without querying for the web page and do a diff?

EXAMPLES:

  1. http://www.example.com/ VS http://www.example.com (Mentioned above)
  2. http://www.example.com/aa/../ VS http://www.example.com

EDIT: Clarifications: Just want to know if the URLs are the same in the context of being equivalent according to the RFC 1738 standard.

In .Net, you can use the System.Uri class.

let u1 = new Uri(" http://www.google.com/ ");;

val u1 : Uri = http://www.google.com/

let u2 = new Uri(" http://www.google.com ");;

val u2 : Uri = http://www.google.com/

u1.Equals(u2);;

val it : bool = true

For more fine-grained comparison, you can use the Uri.Compare method. There are also static methods to deal with various forms of escaping and encoding of characters in the Uri string, which will no doubt prove useful when dealing with the subject thoroughly.

There is very little you can do without requesting the URL. But you can define several heuristics:

  1. Remove trailing slashes
  2. Consider .htm and .html the same
  3. Assume /base/ and /base/index.html are the same
  4. Remove query string parameters (maybe, maybe not, depends on your needs)
  5. Consider url.com and www.url.com the same.

It is all very dependent on what exactly you mean by URLs which are the "same".

For the benefit of those of you who don't know F#, here's a quick and dirty but complete C# console app that demonstrates the use of the Uri class to tell if two URLs are the same. When you run this code, you should see two lines: "true", followed by "false":

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine(IsSameUrl("http://stackoverflow.com/", "http://stackoverflow.com").ToString());
            Console.WriteLine(IsSameUrl("http://stackoverflow.com/", "http://codinghorror.com").ToString());
            Console.ReadKey();
        }

        static bool IsSameUrl(string url1, string url2)
        {
            Uri u1 = new Uri(url1);
            Uri u2 = new Uri(url2);
            return u1.Equals(u2);
        }
    }
}

There are few things to add to Yuval A answers:

  • www.google.com and http://www.google.com may points to the same target
  • www.google.com and google.com points to the same page (but it is implemented by redirecting)
  • Url may be encoded (see HttpUtility.UrlEncode / Decode methods)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM