简体   繁体   中英

Regex to match only letters and numbers

Can you help with this code?

It seems easy, but always fails.

@Test
public void normalizeString(){
    StringBuilder ret =  new StringBuilder();
    //Matcher matches = Pattern.compile( "([A-Z0-9])" ).matcher("P-12345678-P");
    Matcher matches = Pattern.compile( "([\\w])" ).matcher("P-12345678-P");
    for (int i = 1; i < matches.groupCount(); i++)
        ret.append(matches.group(i));

    assertEquals("P12345678P", ret.toString());
}

Constructing a Matcher does not automatically perform any matching. That's in part because Matcher supports two distinct matching behaviors, differing in whether the match is implicitly anchored to the beginning of the Matcher 's region. It appears that you could achieve your desired result like so:

@Test
public void normalizeString(){
    StringBuilder ret =  new StringBuilder();
    Matcher matches = Pattern.compile( "[A-Z0-9]+" ).matcher("P-12345678-P");

    while (matches.find()) {
        ret.append(matches.group());
    }

    assertEquals("P12345678P", ret.toString());
}

Note in particular the invocation of Matcher.find() , which was a key omission from your version. Also, the nullary Matcher.group() returns the substring matched by the last find() .

Furthermore, although your use of Matcher.groupCount() isn't exactly wrong, it does lead me suspect that you have the wrong idea about what it does. In particular, in your code it will always return 1 -- it inquires about the pattern , not about matches to it.

First of all you don't need to add any group because entire match can be always accessed by group 0, so instead of

  • (regex) and group(1)

you can use

  • regex and group(0)

Next thing is that \\\\w is already character class so you don't need to surround it with another [ ] , because it will be similar to [[az]] which is same as [az] .

Now in your

for (int i = 1; i < matches.groupCount(); i++)
    ret.append(matches.group(i));

you will iterate over all groups from 1 but you will exclude last group, because they are indexed from 1 so n so i<n will not include n . You would need to use i <= matches.groupCount() instead.

Also it looks like you are confusing something. This loop will not find all matches of regex in input. Such loop is used to iterate over groups in used regex after match for regex was found .

So if regex would be something like (\\w(\\w))c and your match would be like abc then

for (int i = 1; i < matches.groupCount(); i++)
    System.out.println(matches.group(i));

would print

ab
b

because

  • first group contains two characters (\\w(\\w)) before c
  • second group is the one inside first one, right after first character.

But to print them you actually would need to first let regex engine iterate over your input and find() match, or check if entire input matches() regex, otherwise you would get IllegalStateException because regex engine can't know from which match you want to get your groups (there can be many matches of regex in input).

So what you may want to use is something like

StringBuilder ret =  new StringBuilder();
Matcher matches = Pattern.compile( "[A-Z0-9]" ).matcher("P-12345678-P");
while (matches.find()){//find next match
    ret.append(matches.group(0));
}
assertEquals("P12345678P", ret.toString());

Other way around (and probably simpler solution) would be actually removing all characters you don't want from your input. So you could just use replaceAll and negated character class [^...] like

String input = "P-12345678-P";
String result = input.replaceAll("[^A-Z0-9]+", "");

which will produce new string in which all characters which are not A-Z0-9 will be removed (replaced with "" ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM