简体   繁体   中英

Capturing group eat some characters

My input is like this :

line 1
**ER1.RIAA.SOMPSFIO(LIAOEE)         UTGD788  FDSJOFUZZÄ                                                                                                                                                                              line2      
JNDJZSDS ER1.RIAA.SIMEDFUA(AUDD)                YIRIHFIH1465EZZÄ     

line 3
UJZRJOERERÃLDE,UIE='UJ1.DHZKZ5.OZDEZN98.AAERRE',I=DZEDE                   POPZEOE**

I would like to get only characters with periods , ie :

ER1.RIAA.SOMPSFIO
ER1.RIAA.SIMEDFUA
UJ1.DHZKZ5.OZDEZN98.AAERRE

My solution is that :

try {
    StringBuilder sb = new StringBuilder();
    String line = br.readLine();

    while (line != null) {
        nrligne++;

int counter = 0;

for (int i=0; i<line.length(); i++ ) {
    if( line.charAt(i) == '.' ) {
        counter++;
    }
}

if (counter == 2) {

    if (line.matches("^.*[A-Z0-9]+\\..[A-Z1-9]+.*$")){

        line= removeTroublesomeCharacters(line);
        System.out.println("ligne vaut "+line);

        Pattern dsnPattern = Pattern.compile("^.*([A-Z0-9]+)\\..([A-Z1-9]+)\\..([A-Z1-9]+).*$");
        Matcher m = dsnPattern.matcher(line);

        if (m.matches()) {
            String part1   = m.group(1);
            String part2   = m.group(2);
            String part3   = m.group(3);

            System.out.println("part1 vaut "+part1);
            System.out.println("part2 vaut "+part2);
            System.out.println("part2 vaut "+part3);
        }               
    }

For the moment the result is

ligne vaut ER1.RIAA.SOMPSFIO(LIAOEE)                                                                                             UTGD788
part1 vaut 1
part2 vaut IAA
part2 vaut OMPSFIO
ligne vaut PZFDSJOFUZZÃâ                                                                                                                                                                                    ER1.RIAA.SIMEDFUA(AUDD)                                                                                             UOOO88
part1 vaut 1
part2 vaut IAA
part2 vaut IMEDFUA
ligne vaut UJZRJOERERÃLDE,UIE='UJ1.DHZKZ5.OZDEZN98',I=DZEDE                                                                                                                                                                                                                                                                      POPZEOE
part1 vaut 1
part2 vaut HZKZ5
part2 vaut ZDEZN98

File in input : http://uploadhero.co/dl/PWBLhi7d I don't understand why the regex eat the begin of each characters ? Can someone help me to fix this ?

Because you are consuming an extra character after the dot , and not including it in character class.

\\..   // this will match a dot, and then following single character.

Also, make the .* at the beginning of your regex to .*? . Since quantifiers are by default greedy, they will consume all the characters, and just leave a single word just before the . to be matched by ([a-z0-9]+) .

Change your regex to:

"^.*?([A-Z0-9]+)\\.([A-Z1-9]+)\\.([A-Z1-9]+).*$"

Also, since you are using Pattern and Matcher anyways, I will consider using the Matcher#find() method, and build pattern just for the part that I need:

Pattern dsnPattern = Pattern.compile("([A-Z0-9]+)\\.([A-Z1-9]+)\\.([A-Z1-9]+)");
Matcher m = dsnPattern.matcher(line);

if (m.find()) {

Since the capturing parts of your expression are preceded and followed by "eat anything" .* expression, part of what you wish to capture ends up being consumed by these "fillers".

You can explicitly require that the characters before and after [A-Z0-9]+ groups be non-alphanumeric, like this:

   "^.*(?<![A-Z0-9])([A-Z0-9]+)\\.([A-Z1-9]+)\\.([A-Z1-9]+)(?![A-Z1-9]).*$"
  • The (?<![A-Z0-9]) expression means "not preceded by [A-Z0-9] "
  • The (?![A-Z1-9]) expression means "not followed by [A-Z1-9] "

EDIT :

The lookahead is not necessary, because + is greedy:

"^.*(?<![A-Z0-9])([A-Z0-9]+)\\.([A-Z1-9]+)\\.([A-Z1-9]+).*$"

(credit for this goes to Rohit Jain )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM