简体   繁体   中英

Java regular expression to match patterns and extract them

I tried writing a program in Java using regex to match a pattern and extract it. Given a string like "This is a link- #www.google.com# and this is another #google.com#" I should be able to get #www.google.com# and #google.com# strings extracted. Here is what I tried-

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class ParseLinks {
    public static void main(String[] args) {
        String message = "This is a link- #www.google.com# and this is another #google.com#";
        Pattern p = Pattern.compile("#.*#");

        Matcher matcher = p.matcher(message);

        while(matcher.find()) {
            String result = matcher.group();
            System.out.println(result);
        }       
    }
}

This results in output- #www.google.com# and this is another #google.com#. But what I wanted is only the strings #www.google.com# and #google.com# extracted. Can I please know the regex for this?

#[^#]+#

Though thinking about it, a hash sign is a bad choice for delimiting URLs, for rather obvious reasons.

The reason why your's does not work is the greediness of the star (from regular-expressions.info ):

[The star] repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.

Assuming Java regex supports it, use the non-greedy pattern .*? instead of the greedy .* so that it will end the capture as soon as possible instead of as late as possible.

If the language doesn't support it, then you can approximate it by simply checking for anything that's not an ending delimiter, like so:

#[^#]*#

Regular expressions are " greedy " by default, that is, they will match as much text as possible. In your example, the pattern "#.*#" translates to

  • match a "#"
  • match as many characters as possible such that you can still ...
  • ... match a "#"

What you want is a " non-greedy " or " reluctant " pattern such as "*?". Try "#.*?#" in your case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM