What's wrong with this regex?

Question

I am trying the following code on Java:

String test = "http://asda.aasd.sd.google.com/asdasdawrqwfqwfqwfqwf";
String regex = "[http://]{0,1}([a-zA-Z]*.)*\\.google\\.com/[-a-zA-Z/_.?&=]*";
System.out.println(test.matches(regex));

It does work for several minutes (after that I killed the VM) with no result. Can anyone help me?

BTW: What will you recommend me to do to speed up weblink-testng regexes in future?

Answer 1

[http://] is a character class, meaning any one of those characters from the set.

Just leave those particular square brackets off if it must start with http:// . If it's optional, you can use (http://)? .

One obvious problem is that you're looking for the sequence ([a-zA-Z]+.)*\\\\.google - this will do a lot of backtracking due to that naked . which means "any character" rather than the literal period that you wanted.

But even if you replace it with what you meant , ([a-zA-Z]+\\\\.)*\\\\.google , you still have a problem - this will then require two . characters immediately before google . You should instead try:

String regex = "(http://)?([a-zA-Z]+\\.)*google\\.com/[-a-zA-Z/_.?&=]*";

That returns immediately for me with a true match.

Keep in mind that this currently requires the / at the end of google.com . If that's a problem, it's a minor fix, but I've left it there since you had it in your original regex.

Answer 2

You are trying to match the scheme as a character class using square brackets. That means only zero or one of the characters from that set. You want a subpattern, with parentheses. You can also change {0,1} to just say ? .

Also, you should remove the period just before google\\\\.com because you're already looking for a period in the subdomain subpattern of your regex. As cherouvim points out, you forgot to escape that period as well.

String regex = "(http://)?([a-zA-Z]+\\.)*google\\.com/[-a-zA-Z/_.?&=]*";

Answer 3

In the ([a-zA-Z]*.) part you either need to escape the . (because right now it means "all characters") or remove it.

Answer 4

There are two problems with the regular expression.

The first is easy, as was mentioned by others. You need to match "http://" as a subpattern, not as a character class. Change the brackets to parentheses.

The second problem causes the very poor performance. It's causing the regex to backtrack repeatedly, trying to match the pattern.

What you're trying to do is match zero or more subdomains, which are groups of letters followed by a dot. Since you want to match the dot explicitly, escape the dot. Also remove the dot in front of "google" so you can match "http://google.com/etc" (ie, no leading dot in front of google).

So your expression becomes:

String regex = "(http://){0,1}([a-zA-Z]+\\.)*google\\.com/[-a-zA-Z/_.?&=]*";

Running this regex on your example takes just a fraction of a second.

Answer 5

Assuming you fix the ([a-zA-Z]*\\\\.) you need to change * to + so the part becomes ([a-zA-Z]+\\\\.) . Otherwise you'll be accepting http://...google.com and this is not valid.

Answer 6

By grouping part before google.com I assume you are looking for part of URL host name. I think that rexep is powerful tool, but you can simply use URL Java class . There is getHost() method. Then you can check if host name ends with google.com and split it or use some simplier regexp with only host name.

URL url = new URL("http://asda.aasd.sd.google.com/asdasdawrqwfqwfqwfqwf");
String host = url.getHost();
if (host.endsWith("google.com"))
    {
    String [] parts = host.split("\\.");
    for (String s: parts)
        System.out.println(s);
    }

What's wrong with this regex?

Question

6 answers

solution1
7 ACCPTED 2010-11-05 08:03:57

solution2
4 2010-11-05 08:03:43

solution3
3 2010-11-05 08:05:23

solution4
2 2010-11-05 08:16:08

solution5
1 2010-11-05 08:12:19

solution6
1 2010-11-05 08:28:01

What's wrong with this regex?

Question

6 answers

solution1 7 ACCPTED 2010-11-05 08:03:57

solution2 4 2010-11-05 08:03:43

solution3 3 2010-11-05 08:05:23

solution4 2 2010-11-05 08:16:08

solution5 1 2010-11-05 08:12:19

solution6 1 2010-11-05 08:28:01

solution1
7 ACCPTED 2010-11-05 08:03:57

solution2
4 2010-11-05 08:03:43

solution3
3 2010-11-05 08:05:23

solution4
2 2010-11-05 08:16:08

solution5
1 2010-11-05 08:12:19

solution6
1 2010-11-05 08:28:01