简体   繁体   中英

Java regex: Match URLs with spaces and brackets

With Java Regex, I am not able to match URL's which have spaces, ( and ) brackets, below is a code example, can you please help. Only last URL's E.jpeg works.

Code :

public static void main(String[] args) {
    String content = "Lorem ipsum https://example.com/A B 123 4.pdf   https://example.com/(C.jpeg   https://example.com/D).jpeg   https://example.com/E.jpeg";
    extractUrls(content);
}

public static void extractUrls(String text) {
    Pattern pat = Pattern.compile("(https?)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]", Pattern.CASE_INSENSITIVE);
    Matcher matcher = pat.matcher(text);
    while (matcher.find()) {
        System.out.println(matcher.group());
    }
}

Output :

https://example.com/A
https://example.com/
https://example.com/D
https://example.com/E.jpeg

Expected output:

https://example.com/A B 123 4.pdf
https://example.com/(C.jpeg
https://example.com/D).jpeg
https://example.com/E.jpeg

Take a look at this code:

import java.lang.Math; 
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class MyClass {
    public static void main(String[] args) {
        String content = "Lorem ipsum https://example.com/A B 123 4.pdf   https://example.com/(C.jpeg   https://example.com/D).jpeg   https://example.com/E.jpeg";
        extractUrls(content);
    }

    public static void extractUrls(String text) {
        Pattern pat = Pattern.compile("(https?)://(([\\S]+)(\\s)?)*", Pattern.CASE_INSENSITIVE);
        Matcher matcher = pat.matcher(text);
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

The output:

https://example.com/A B 123 4.pdf 
https://example.com/(C.jpeg 
https://example.com/D).jpeg 
https://example.com/E.jpeg

Explaining:

I assume the file name does not have two consecutive blank spaces, as shown in the examples.

The (https?):// identifies the substrings http:// or https:// .

We have two groups on this piece: (([\\S]+)(\\s)? . It identifies 1 or more characters (other than white space) followed by only 1 or 0 blank characters.

With the character * this process can be repeated several times.

Therefore our expression understands that if there are 2 or more blank spaces, it is the separation between two filenames.

I hope it helps.

Answer from "The fourth bird" user solved this problem, regex should be:

http.*?\.(?:pdf|jpe?g)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM