Java regEx URL匹配问题

Question

and as usual thank you in advance. 和往常一样，谢谢你。

I am trying to familiarize myself with regEx and I am having an issue matching a URL. 我正在尝试使自己熟悉regEx，但遇到与URL匹配的问题。

Here is an example URL: 这是一个示例URL：

www.examplesite.com/dir/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html

here is what my regex breakdown looks like: 这是我的正则表达式分解如下：

[site]/[dir]*?/[year]/[month]/[day]/[storyTitle]?/[id]/htmlpage.html

the [id] is a string 22 characters in length that can be either uppercase or lowercase letters, as well as numbers. [id]是一个字符串，长度为22个字符，可以是大写或小写字母以及数字。 However, I do not want to extract that from the URL. 但是，我不想从URL中提取出来。 Just clarifying 只是澄清一下

Now, I need to extract two values from this url. 现在，我需要从该URL中提取两个值。

First, I need to extract the dirs(s). 首先，我需要提取目录。 However, the [dir] is optional, but also can be as many as wanted. 但是， [dir]是可选的，但也可以任意多个。 In other words that parameter could not be there, or it could be dir1/dir2/dir3 ..etc . 换句话说，该参数可能不存在，也可能是dir1/dir2/dir3 ..etc。 So, going off my first example : 因此，从我的第一个例子开始：

    www.examplesite.com/dir1/dir2/dir3/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html

Here I would need to extract dir1/dir2/dir3 where a dir is a string that is a single word with all lowercase letters (ie sports/mlb/games). 在这里，我需要提取dir1/dir2/dir3 ，其中dir是一个字符串，是一个包含所有小写字母（即sports / mlb / games）的单词。 There are no numbers in the dir, only using that as an example. 目录中没有数字，仅以数字为例。

But in this example of a valid URL: 但是在此有效URL的示例中：

www.examplesite.com/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html

There is no [dir] so I would not extract anything. 没有[dir]所以我不会提取任何东西。 thus, the [dir] is optional 因此， [dir]是可选的

Secondly, I need to extract the [storyTitle] where the [storyTitle] is also optional just like the [dir] above, but however if there is a storyTitle there can only be one. 其次，我需要提取[storyTitle] ，其中[storyTitle]也是可选的，就像上面的[dir] ，但是，如果有一个storyTitle ，则只能有一个。

So going off my previous examples 所以从我以前的例子

www.examplesite.com/dir/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html

would be valid where I need to extract 'title-of-some-story' where story titles are dash separated strings that are always lowercase. 在需要提取'title-of-some-story'故事标题'title-of-some-story'下有效，其中故事标题是用短划线分隔的字符串，始终为小写。 The example belowis also valid: 以下示例也有效：

www.examplesite.com/dir/2012/06/19/FAQKZjC3veXSalP9zxFgZP/htmlpage.html

In the above example, there is no [storyTitle] thus making it optional 在上面的示例中，没有[storyTitle]因此使其为可选

Lastly, just to be thorough, a URL without a [dir] and without a [storyTitle] are also valid. 最后，仅此[storyTitle] ，没有[dir]和[storyTitle]的URL也是有效的。 Example: 例：

www.examplesite.com/2012/06/19/FAQKZjC3veXSalP9zxFgZP/htmlpage.html

Is a valid URL. 是有效的网址。 Any input would be helpful I hope I am clear. 任何输入都会有所帮助，我希望我清楚。

Answer 1

Here is one example that will work. 这是一个可行的例子。

public static void main(String[] args) {

    Pattern p = Pattern.compile("(?:http://)?.+?(/.+?)?/\\d+/\\d{2}/\\d{2}(/.+?)?/\\w{22}");

    String[] strings ={
            "www.examplesite.com/dir1/dir2/4444/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html",
            "www.examplesite.com/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html",
            "www.examplesite.com/dir/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html",
            "www.examplesite.com/dir/2012/06/19/FAQKZjC3veXSalP9zxFgZP/htmlpage.html",
            "www.examplesite.com/2012/06/19/FAQKZjC3veXSalP9zxFgZP/htmlpage.html"
    };
    for (int idx = 0; idx < strings.length; idx++) {
        Matcher m = p.matcher(strings[idx]);
        if (m.find()) {
            String dir = m.group(1);
            String title = m.group(2);
            if (title != null) {
                title = title.substring(1); // remove the leading /
            }
            System.out.println(idx+": Dir: "+dir+", Title: "+title);
        }
    }
}

Answer 2

Here is an all regex solution. 这是所有正则表达式的解决方案。

Edit: Allows for http:// 编辑：允许http：//

Java source: Java来源：

import java.util.*;
import java.lang.*;
import java.util.regex.*;

class Main
{
    public static void main (String[] args) throws java.lang.Exception
    {
        String url = "http://www.examplesite.com/dir/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html";
        String url2 = "www.examplesite.com/dir/dir2/dir3/2012/06/19/FAQKZjC3veXSalP9zxFgZP/htmlpage.html";
        String url3 = "www.examplesite.com/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html";

        String patternStr = "(?:http://)?[^/]*[/]?([\\S]*)/[\\d]{4}/[\\d]{2}/[\\d]{2}[/]?([\\S]*)/[\\S]*/[\\S]*";

        // Compile regular expression
        Pattern pattern = Pattern.compile(patternStr);


        // Match 1st url
        System.out.println("Match 1st URL:");
        Matcher matcher = pattern.matcher(url);

        if (matcher.find()) {
            System.out.println("URL: " + matcher.group(0));
            System.out.println("DIR: " + matcher.group(1));
            System.out.println("TITLE: " + matcher.group(2));
        }
        else{ System.out.println("No match."); }


        // Match 2nd url
        System.out.println("\nMatch 2nd URL:");
        matcher = pattern.matcher(url2);

        if (matcher.find()) {
            System.out.println("URL: " + matcher.group(0));
            System.out.println("DIR: " + matcher.group(1));
            System.out.println("TITLE: " + matcher.group(2));
        }
        else{ System.out.println("No match."); }


        // Match 3rd url
        System.out.println("\nMatch 3rd URL:");
        matcher = pattern.matcher(url3);

        if (matcher.find()) {
            System.out.println("URL: " + matcher.group(0));
            System.out.println("DIR: " + matcher.group(1));
            System.out.println("TITLE: " + matcher.group(2));
        }
        else{ System.out.println("No match."); }
    }
}

Output: 输出：

Match 1st URL:
URL: http://www.examplesite.com/dir/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html
DIR: dir
TITLE: title-of-some-story

Match 2nd URL:
URL: www.examplesite.com/dir/dir2/dir3/2012/06/19/FAQKZjC3veXSalP9zxFgZP/htmlpage.html
DIR: dir/dir2/dir3
TITLE: 

Match 3rd URL:
URL: www.examplesite.com/2012/06/19/title-of-some-story/FAQKZjC3veXSalP9zxFgZP/htmlpage.html
DIR: 
TITLE: title-of-some-story

Java regEx URL匹配问题

问题描述

2 个解决方案

解决方案1
1 已采纳 2012-06-20 16:45:36

解决方案2
0 2012-06-20 17:25:47

Java regEx URL匹配问题

问题描述

2 个解决方案

解决方案1 1 已采纳 2012-06-20 16:45:36

解决方案2 0 2012-06-20 17:25:47

解决方案1
1 已采纳 2012-06-20 16:45:36

解决方案2
0 2012-06-20 17:25:47