简体   繁体   中英

Extracting URLs from a text document using Java + Regular Expressions

I'm trying to create a regular expression to extract URLs from text documents using Java, but thus far I've been unsuccessful. The two cases I'm looking to capture are listed below:

URLs that start with http:// URLs that start with www. (Missing the protocol from the front)

along with the query string parameters.

Thanks! I wish I really knew Regular expressions better.

Cheers,

If you want to make sure you are really matching a url adress and not only some word starting with 'www.' you can use the expression mentioned by DVK before. I modified it slightly and wrote a small code snippet to be a starting point for you:

import java.util.*;
import java.util.regex.*;

class FindUrls
{
    public static List<String> extractUrls(String input) {
        List<String> result = new ArrayList<String>();

        Pattern pattern = Pattern.compile(
            "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
            "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
            "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
            "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
            "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
            "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");

        Matcher matcher = pattern.matcher(input);
        while (matcher.find()) {
            result.add(matcher.group());
        }

        return result;
    }
}

All RegEx -based code is over-engineered , especially code from the most voted answer, and here is why: it will find only valid URLs! As a sample, it will ignore anything starting with "http://" and having non-ASCII characters inside.

Even more: I have encountered 1-2-seconds processing times (single-threaded, dedicated) with Java RegEx package for very small and simple sentences, nothing specific; possibly bug in Java 6 RegEx...

Simplest/Fastest solution would be to use StringTokenizer to split text into tokens, to remove tokens starting with "http://" etc., and to concatenate tokens into text again.

If you really want to use RegEx with Java, try Automaton

This link has very good URL RegExs (they are surprisingly hard to get right, by the way - thinh http/https; port #s, valid characters, GET strings, pound signs for anchor links, etc...)

http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/

Perl has CPAN libraries that contain cannedRegExes, including for URLs. Not sure about Java though :(

This tests a certain line if it is a URL

Pattern p = Pattern.compile("http://.*|www\\..*");
Matcher m = p.matcher("http://..."); // put here the line you want to check
if(m.matches()){
    so something
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM