简体   繁体   English

从给定的URL中提取主域名

[英]Extract main domain name from a given url

I used the following to extract the domain from a url: (They are test cases) 我使用以下内容从URL中提取域:(它们是测试用例)

String regex = "^(ww[a-zA-Z0-9-]{0,}\\.)";
ArrayList<String> cases = new ArrayList<String>();
cases.add("www.google.com");
cases.add("ww.socialrating.it");
cases.add("www-01.hopperspot.com");
cases.add("wwwsupernatural-brasil.blogspot.com");
cases.add("xtop10.net");
cases.add("zoyanailpolish.blogspot.com");

for (String t : cases) {  
    String res = t.replaceAll(regex, "");  
}

I can get the following results: 我可以得到以下结果:

google.com
hopperspot.com
socialrating.it
blogspot.com
xtop10.net
zoyanailpolish.blogspot.com

The first four cases are good. 前四个案例都很好。 The last one is not good. 最后一个不好。 What I want is: blogspot.com for the last one, but it gives zoyanailpolish.blogspot.com . 我想要的是: blogspot.com的最后一个,但它给zoyanailpolish.blogspot.com What am I doing wrong? 我究竟做错了什么?

Using Guava library, we can easily get domain name: 使用Guava库,我们可以轻松获得域名:

InternetDomainName.from(tld).topPrivateDomain()

Refer API link for more details 有关详细信息,请参阅API链接

https://google.github.io/guava/releases/14.0/api/docs/ https://google.github.io/guava/releases/14.0/api/docs/

http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/net/InternetDomainName.html http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/net/InternetDomainName.html

Obtain the host through REGEX is pretty complicated or impossible because TLD's don't obey to simple rules but are provided by ICANN and change in time. 通过REGEX获取主机非常复杂或不可能,因为TLD不遵守简单的规则,但由ICANN提供并及时更改。

You should use instead the functionality provided by JAVA library like this: 您应该使用JAVA库提供的功能,如下所示:

URL myUrl = new URL(urlString);
myUrl.getHost();

这是2013年,我发现解决方案是直截了当的:

System.out.println(InternetDomainName.fromLenient(uriHost).topPrivateDomain().name());

It is much simpler: 它更简单:

  try {
        String domainName = new URL("http://www.zoyanailpolish.blogspot.com/some/long/link").getHost();

        String[] levels = domainName.split("\\.");
        if (levels.length > 1)
        {
            domainName = levels[levels.length - 2] + "." + levels[levels.length - 1];
        }

        // now value of domainName variable is blogspot.com
    } catch (Exception e) {}

As suggested by BalusC and others the most practical solution would be to get a list of TLDs (see this list ), save them to a file, load them and then determine what TLD is being used by a given url String. 正如BalusC和其他人所建议的那样,最实际的解决方案是获取TLD列表(请参阅此列表 ),将它们保存到文件中,加载它们,然后确定给定URL字符串使用的TLD。 From there on you could constitute the main domain name as follows: 从那以后,您可以构成主域名,如下所示:

    String url = "zoyanailpolish.blogspot.com";

    String tld = findTLD( url ); // To be implemented. Add to helper class ?

    url = url.replace( "." + tld,"");  

    int pos = url.lastIndexOf('.');

    String mainDomain = "";

    if (pos > 0 && pos < url.length() - 1) {
        mainDomain = url.substring(pos + 1) + "." + tld;
    }
    // else: Main domain name comes out empty

The implementation details are left up to you. 实施细节由您自己决定。

The reason why your are seeing zoyanailpolish.blogspot.com is that your regex finds only strings that start with a 'ww'. 您之所以看到zoyanailpolish.blogspot.com是因为您的正则表达式只找到以'ww' 开头的字符串。 What you are asking is that in addition to removing all strings that start with a 'ww' , it should also work for a string starting with 'zoyanailpolish' (?). 你要问的是除了删除以'ww'开头的所有字符串之外,它还适用于以'zoyanailpolish'(?)开头的字符串。 In that case , use the regex String regex = "^((ww|z|a)[a-zA-Z0-9-]{0,}\\\\.)"; 在这种情况下,使用正则表达式String regex = "^((ww|z|a)[a-zA-Z0-9-]{0,}\\\\.)"; This will remove any word that starts with a 'ww' or 'z' or 'a'. 这将删除任何以'ww'或'z'或'a'开头的单词。 Customize it based on what you need exactly. 根据您的需求进行自定义。

InternetDomainName.from("test.blogspot.com").topPrivateDomain() -> test.blogspot.com

This works better in my case: 在我的情况下这更好用:

InternetDomainName.from("test.blogspot.com").topDomainUnderRegistrySuffix() -> blogspot.com

Details: https://github.com/google/guava/wiki/InternetDomainNameExplained 详细信息: https//github.com/google/guava/wiki/InternetDomainNameExplained

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM