How can I adjust this regex to filter out "

Question

I got the following regex working to search for video links in a page

(http(s?):/)(/[^/]+)\\S+.\\.(?:avi|flv|mp4)

Unfortunately it does not stop at the end of the link if there is another match right behind it, for example this video link

<a href="http://somevideo.flv">somevideoname.avi</a>

would, after regex return this:

http://somevideo.flv">somevideoname.avi

How can I adjust the regex to avoid this? I would like to learn more about regex, its fascinating but so complex!

Answer 1

Here is how you can do something similar with JSoup parser.

Scanner scanner = new Scanner(new File("input.txt"));
scanner.useDelimiter("\\Z");
String htmlString = scanner.next();
scanner.close();

Document doc = Jsoup.parse(htmlString);
// or to get connect of some page use
// Document doc = Jsoup.connect("http://example.com/").get();
Elements elements = doc.select("a[href]");//find all anchors with href attribute
for (Element el : elements) {
    URL url = new URL(el.attr("href"));
    if (url.getPath().matches(".*\\.(?:avi|flv|mp4)")) {
        System.out.println("url: " + url);
        //System.out.println("file: " + url.getPath());
        System.out.println("file name: "
                + new File(url.getPath()).getName());
        System.out.println("------");
    }
}

Answer 2

I'm not sure I understand the groupings in your regexp. At any rate, this one should work:

\\bhttps?://[^\"]+?\\.(?:avi|flv|mp4)\\b

Answer 3

If you only want to extract href attribute values then you're better off matching against the following pattern:

href=("|')(.*?)\.(avi|flv|mp4)\1

This should match "href" followed by either a double-quote or single-quote character, then capture everything up to (and including) the next character which matches the starting quote character. Then your href attribute can be extracted by

matcher.group(2) + "." + matcher.group(3)

to concatenate the file path and name with a period and then the file extension.

Answer 4

Your regex is greedy:

Limit its greediness read this :

(http(s?):/)(/[^/]+?)\\S+.\\.(?:avi|flv|mp4)

How can I adjust this regex to filter out "

Question

4 answers

solution1
2 ACCPTED 2013-11-02 13:55:44

solution2
1 2013-11-02 12:58:12

solution3
1 2013-11-02 13:32:20

solution4
1 2013-11-02 13:34:34

How can I adjust this regex to filter out "

Question

4 answers

solution1 2 ACCPTED 2013-11-02 13:55:44

solution2 1 2013-11-02 12:58:12

solution3 1 2013-11-02 13:32:20

solution4 1 2013-11-02 13:34:34

solution1
2 ACCPTED 2013-11-02 13:55:44

solution2
1 2013-11-02 12:58:12

solution3
1 2013-11-02 13:32:20

solution4
1 2013-11-02 13:34:34