简体   繁体   中英

Regex matching URL with www and no consecutive dots

Can you help me with regex?

I have line

"Sites www.google.com и www.ridd.rdd..com good."

After parse I'v get this type of line:

"Sites http://www.google.com и www.ridd.rdd..com good."

Problem with checking consecutive points. To sites with an error (with two points in a row) "http//:" should not be appended .

My regex:

 Matcher matchr = Pattern.compile("w{3}(\\.\\w+)+[a-z]{2,6}").matcher(text);

        while (matchr.find()) {
            text = text.replace(matchr.group(0), "http://" + matchr.group(0));
        }

        System.out.println(text);

Your regex w{3}(\\\\.\\\\w+)+[az]{2,6} matches a part of the second bad "URL", www.ridd.rdd ..com. So, you need to make sure the substring you match has no consecutive dots. You may use word boundaries and a negative lookahead (?!\\S*\\.{2}) .

Use

String text = "Sites www.google.com и www.ridd.rdd..com good.";
text = text.replaceAll("\\b(?!\\S*\\.{2})w{3}(\\.\\w+)+[a-z]{2,6}\\b", "http://$0");
// => Sites http://www.google.com и www.ridd.rdd..com good.

See the IDEONE demo

Pattern explanation:

  • \\\\b - leading word boundary
  • (?!\\\\S*\\\\.{2}) - there should not be any consecutive dots in the non-whitespace chunk to follow
  • w{3} - match www
  • (\\\\.\\\\w+)+ - 1+ sequences of . followed with 1+ alphanumeric or underscore characters
  • [az]{2,6} - make sure there are 2 to 6 az letters...
  • \\\\b - at the end of this "word"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM