简体   繁体   中英

Top level domain from URL in C#

I am using C# and ASP.NET for this.

We receive a lot of "strange" requests on our IIS 6.0 servers and I want to log and catalog these by domain.

Eg. we get some strange requests like these:

http://www.poker.winner4ever.example.com/

http://www.hotgirls.example.com/

http://santaclaus.example.com/

http://m.example.com/

http://wap.example.com/

http://iphone.example.com/

the latter three are kinda obvious, but I would like to sort them all into one as "example.com" IS hosted on our servers. The rest isn't, sorry :-)

So I am looking for some good ideas for how to retrieve example.com from the above. Secondly I would like to match the m., wap., iphone etc into a group, but that's probably just a quick lookup in a list of mobile shortcuts.I could handcode this list for a start.

But is regexp the answer here or is pure string manipulation the easiest way? I was thinking of "splitting" the URL string by "." and the look for item[0] and item[1]...

Any ideas?

You can use the following nuget Nager.PublicSuffix package. It uses the same data source that browser vendors use.

nuget

PM> Install-Package Nager.PublicSuffix

Example

var domainParser = new DomainParser(new WebTldRuleProvider());

var domainInfo = domainParser.Parse("sub.test.co.uk");
//domainInfo.Domain = "test";
//domainInfo.Hostname = "sub.test.co.uk";
//domainInfo.RegistrableDomain = "test.co.uk";
//domainInfo.SubDomain = "sub";
//domainInfo.TLD = "co.uk";

The following code uses the Uri class to obtain the host name, and then obtains the second level host (examplecompany.com) from Uri.Host by splitting the host name on periods.

var uri = new Uri("http://www.poker.winner4ever.examplecompany.com/");
var splitHostName = uri.Host.Split('.');
if (splitHostName.Length >= 2)
{
    var secondLevelHostName = splitHostName[splitHostName.Length - 2] + "." +
                              splitHostName[splitHostName.Length - 1];
}

There may be some examples where this returns something other than what is desired, but country codes are the only ones that are 2 characters, and they may or may not have a short second level (2 or 3 characters) typically used. Therefore, this will give you what you want in most cases:

string GetRootDomain(string host)
{
    string[] domains = host.Split('.');

    if (domains.Length >= 3)
    {
        int c = domains.Length;
        // handle international country code TLDs 
        // www.amazon.co.uk => amazon.co.uk
        if (domains[c - 1].Length < 3 && domains[c - 2].Length <= 3)
            return string.Join(".", domains, c - 3, 3);
        else
            return string.Join(".", domains, c - 2, 2);
    }
    else
        return host;
}

This is not possible without a up-to-date database of different domain levels.

Consider:

s1.moh.gov.cn
moh.gov.cn
s1.google.com
google.com

Then at which level you want to get the domain? It's completely depends of the TLD , SLD , ccTLD ... because ccTLD in under control of countries they may define very special SLD which is unknown to you.

I needed the same, so I wrote a class that you can copy and paste into your solution. It uses a hard coded string array of tld's. http://pastebin.com/raw.php?i=VY3DCNhp

Console.WriteLine(GetDomain.GetDomainFromUrl("http://www.beta.microsoft.com/path/page.htm"));

outputs microsoft.com

and

Console.WriteLine(GetDomain.GetDomainFromUrl("http://www.beta.microsoft.co.uk/path/page.htm"));

outputs microsoft.co.uk

I've written a library for use in .NET 2+ to help pick out the domain components of a URL.

More details are on github but one benefit over previous options is that it can download the latest data from http://publicsuffix.org automatically (once per month) so the output from the library should be more-or-less on a par with the output used by web browsers to establish domain security boundaries (ie pretty good).

It's not perfect yet but suits my needs and shouldn't take much work to adapt to other use cases so please fork and send a pull request if you want.

Use a regular expression:

^https?://([\w./]+[^.])?\.?(\w+\.(com)|(co.uk)|(com.au))$

This will match any URL ending with a TLD in which you are interested. Extend the list for as many as you want. Further, the capturing groups will contain the subdomain, hostname and TLD respectively.

uri.Host.ToLower().Replace("www.","").Substring(uri.Host.ToLower().Replace("www.","").IndexOf('.'))
  • returns ".com" for

    Uri uri = new Uri("http://stackoverflow.com/questions/4643227/top-level-domain-from-url-in-c");

  • returns ".co.jp" for Uri uri = new Uri("http://stackoverflow.co.jp");

  • returns ".s1.moh.gov.cn" for Uri uri = new Uri("http://stackoverflow.s1.moh.gov.cn");

etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM