简体   繁体   中英

complex regular expression in Java

I have a rather complex (to me it seems rather complex) problem that I'm using regular expressions in Java for:

I can get any text string that must be of the format:

M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>

I started with a regular expression for extracting the text between the M:/:D:/:C:/:Q: as:

String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\\\.0-9]+)";

And that works fine if the <either a url or string> is just an alphanumeric string. But it all falls apart when the embedded string is a url of the format:

tcp://someurl.something:port

Can anyone help me adjust the above reg exp to extract the text after :D: to be either a url or a alpha-numeric string ?

Here's an example:

public static void main(String[] args) {
    String name = "M:myString1:D:tcp://someurl.com:8989:C:myString2:Q:1";
    boolean matchFound = false;
    ArrayList<String> values = new ArrayList<>();
    String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
    Matcher m3 = Pattern.compile(pattern2).matcher(name);

    while (m3.find()) {
        matchFound = true;
        String m = m3.group(2);
        System.out.println("regex found match:  " + m);
        values.add(m);
    }

}

In the above example, my results would be:

myString1
tcp://someurl.com:8989
myString2
1

And note that the Strings can be of variable length, alphanumeric, but allowing some characters (such as the url format with :// and/or . - characters

You mention that the format is constant:

M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>

Capture groups can do this for you with the pattern:

"M:(.*):D:(.*):C:(.*):Q:(.*)"

Or you can do a String.split() with a pattern of "M:|:D:|:C:|:Q:" . However, the split will return an empty element at the first index. Everything else will follow.

public static void main(String[] args) throws Exception {
    System.out.println("Regex: ");
    String data = "M:<some text>:D:tcp://someurl.something:port:C:<some more text>:Q:<a number>";
    Matcher matcher = Pattern.compile("M:(.*):D:(.*):C:(.*):Q:(.*)").matcher(data);
    if (matcher.matches()) {
        for (int i = 1; i <= matcher.groupCount(); i++) {
            System.out.println(matcher.group(i));
        }
    }
    System.out.println();

    System.out.println("String.split(): ");
    String[] pieces = data.split("M:|:D:|:C:|:Q:");
    for (String piece : pieces) {
        System.out.println(piece);
    }
}

Results:

Regex: 
<some text>
tcp://someurl.something:port
<some more text>
<a number>

String.split(): 

<some text>
tcp://someurl.something:port
<some more text>
<a number>

To extract the URL/text part you don't need the regular expression. Use

int startPos = input.indexOf(":D:")+":D:".length();
int endPos = input.indexOf(":C:", startPos);
String urlOrText = input.substring(startPos, endPos);

Assuming you need to do some validation along with the parsing:

break the regex into different parts like this:

    String m_regex = "[\\w.]+"; //in jsva a . in [] is just a plain dot
    String url_regex = ".";     //theres a bunch online, pick your favorite.
    String d_regex = "(?:" + url_regex + "|\\p{Alnum}+)"; // url or a sequence of alphanumeric characters
    String c_regex = "[\\w.]+"; //but i'm assuming you want this to be a bit more strictive. not sure.
    String q_regex = "\\d+";    //what sort of number exactly? assuming any string of digits here

    String regex = "M:(?<M>" + m_regex + "):"
                 + "D:(?<D>" + d_regex + "):"
                 + "C:(?<D>" + c_regex + "):"
                 + "Q:(?<D>" + q_regex + ")";
    Pattern p = Pattern.compile(regex);

Might be a good idea to keep the pattern as a static field somewhere and compile it in a static block so that the temporary regex strings don't overcrowd some class with basically useless fields.

Then you can retrieve each part by its name:

    Matcher m = p.matcher( input );
    if (m.matches()) {
        String m_part = m.group( "M" );
        ...
        String q_part = m.group( "Q" );
    }

You can go even a step further by making a RegexGroup interface/objects where each implementing object represents a part of the regex which has a name and the actual regex. Though you definitely lose the simplicity makes it harder to understand it with a quick glance. (I wouldn't do this, just pointing out its possible and has its own benefits)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM