简体   繁体   English

在Java中使用正则表达式匹配子域和顶级域

[英]Matching subdomain and top domain using regex in Java

Follow up of this question Regex to match pattern with subdomain in java 跟踪此问题正则表达式以将模式与Java中的子域匹配

I use the below pattern to match the domain and subdomain 我使用以下模式来匹配域和子域

  Pattern pattern = Pattern.compile("http://([a-z0-9]*.)example.com");

this pattern matches the following 该模式匹配以下内容

  • http://asd.example.com
  • http://example.example.com
  • http://www.example.com

but it is not matching 但不匹配

  • http://example.com

Can any one tell me how to match http://example.com too? 谁能告诉我如何搭配http://example.com

Just make the first part optional with a ? 只需使第一部分为可选? :

Pattern pattern = Pattern.compile("http://([a-z0-9]*\\.)?example\\.com");

Note that . 注意. matches any character, you should use \\\\. 匹配任何字符,您应该使用\\\\. to match a literal dot. 匹配文字点。

You can use this regex pattern to get domains of all urls: 您可以使用此正则表达式模式来获取所有网址的域:

\\p{L}{0,10}(?:://)?[\\p{L}\\.]{1,50}

For example; 例如;

Input  = http://www.google.com/search?q=a
Output = http://www.google.com

Input  = ftp://www.google.com/search?q=a
Output = ftp://www.google.com

Input  = www.google.com/search?q=a
Output = www.google.com

Here, \\p{L}{0,10} stands for the http, https and ftp parts (there could be some more i don't know), (?:://)? 在这里,\\ p {L} {0,10}代表http,https和ftp部分(可能还有一些我不知道的部分),(?:: //)? stands for :// part if appears, [\\p{L}\\.]{1,50} stands for the foo.bar.foo.com part. 代表://部分(如果出现),[\\ p {L} \\。] {1,50}代表foo.bar.foo.com部分。 The rest of the url is cut out. 其余的网址被删除。

And here is the java code that accomplises the job: 这是完成这项工作的Java代码:

public static final String DOMAIN_PATTERN = "\\p{L}{0,10}(?:://)?[\\p{L}\\.]{1,50}";

public static String getDomain(String url) {
    if (url == null || url.equals("")) {
        return "";
    }
    Pattern p = Pattern.compile(DOMAIN_PATTERN);
    Matcher m = p.matcher(url);

    if (m.find()) {
        return m.group();
    }
    return "";
}

public static void main(String[] args) {
    System.out.println(getDomain("www.google.com/search?q=a"));
}

Output = www.google.com

Finally, if you want to match just "example.com" you can simply add it to the end of the pattern like : 最后,如果您只想匹配“ example.com”,则可以将其添加到模式的末尾,例如:

\\p{L}{0,10}(?:://)?[\\p{L}\\.]{0,50}example\\.com

And this will get all of the domains with "example.com": 这将使用“ example.com”获取所有域:

Input  = http://www.foo.bar.example.com/search?q=a
Output = http://www.foo.bar.example.com

Note : Note that \\p{Ll} can be used instead of \\p{L} because \\p{Ll} catches lowercase unicode letters (\\p{L} all kind of unicode letters) and urls are constructed of lowercase letters. 注意:请注意,可以使用\\ p {Ll}代替\\ p {L},因为\\ p {Ll}捕获小写的unicode字母(\\ p {L}各种unicode字母),并且url由小写字母构成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM