简体   繁体   中英

How can I adjust this regex to filter out "

I got the following regex working to search for video links in a page

(http(s?):/)(/[^/]+)\\S+.\\.(?:avi|flv|mp4)

Unfortunately it does not stop at the end of the link if there is another match right behind it, for example this video link

<a href="http://somevideo.flv">somevideoname.avi</a>

would, after regex return this:

http://somevideo.flv">somevideoname.avi

How can I adjust the regex to avoid this? I would like to learn more about regex, its fascinating but so complex!

Here is how you can do something similar with JSoup parser.

Scanner scanner = new Scanner(new File("input.txt"));
scanner.useDelimiter("\\Z");
String htmlString = scanner.next();
scanner.close();

Document doc = Jsoup.parse(htmlString);
// or to get connect of some page use
// Document doc = Jsoup.connect("http://example.com/").get();
Elements elements = doc.select("a[href]");//find all anchors with href attribute
for (Element el : elements) {
    URL url = new URL(el.attr("href"));
    if (url.getPath().matches(".*\\.(?:avi|flv|mp4)")) {
        System.out.println("url: " + url);
        //System.out.println("file: " + url.getPath());
        System.out.println("file name: "
                + new File(url.getPath()).getName());
        System.out.println("------");
    }
}

I'm not sure I understand the groupings in your regexp. At any rate, this one should work:

\\bhttps?://[^\"]+?\\.(?:avi|flv|mp4)\\b

If you only want to extract href attribute values then you're better off matching against the following pattern:

href=("|')(.*?)\.(avi|flv|mp4)\1

This should match "href" followed by either a double-quote or single-quote character, then capture everything up to (and including) the next character which matches the starting quote character. Then your href attribute can be extracted by

matcher.group(2) + "." + matcher.group(3)

to concatenate the file path and name with a period and then the file extension.

Your regex is greedy:

Limit its greediness read this :

(http(s?):/)(/[^/]+?)\\S+.\\.(?:avi|flv|mp4)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM