简体   繁体   中英

c language regex matching mutiple parts of a string

I have a c program where I can't get the regular expression matching to work the way I want. Basically, I want to match the 1st char (W or M) in testStr and the name of the log file as the second match (TESTY.LOG). Here is what I have so far:

#include    <stdio.h>
#include    <stdlib.h>
#include    <regex.h>
#define     MAX_MATCHES 2
.....
char testStr[20]="W TESTY.LOG ";
char temp[100];
int reti;
regex_t regex;
regmatch_t matches[MAX_MATCHES];
int i;
int numchars;

/* Compile regular expression */
reti = regcomp(&regex, "^([W|M])[[:space:]]([A-Z|0-9|\.]{1,})[[:space:]]*$", REG_EXTENDED);
/* Execute regular expression */
reti = regexec(&regex, testStr, MAX_MATCHES, matches, 0);
if (!reti) {
  for (i=0; i < MAX_MATCHES; i++) {
    numchars = (int)matches[i].rm_eo - (int)matches[i].rm_so;
    strncpy(temp,testStr+matches[i].rm_so,numchars);
    temp[numchars] = '\0';
  }
}

When I run this in gdb, I see the following for matches:

(gdb) display matches 1: matches = {{rm_so = 0, rm_eo = 15}, {rm_so = 0, rm_eo = 1}}

2: temp = "W TESTY.LOG"

and

2: temp = "W"

So, I'm getting the first char OK but I am not getting just the log file name for the second match. I use regex in perl but I'm new to regex in ansi c. I feel like I'm missing something basic here.

Match 0 is the part of the string matched by the entire regex (Perl's $& ). Match i for i > 0 is the part of the match corresponding to capture number i , the same as Perl's $1, $2, … . You have two captures, so you should expect three matches. But you specify MAX_MATCH as 2, so the last match is discarded.


Also, the regular expression

^([W|M])[[:space:]]([A-Z|0-9|\.]{1,})[[:space:]]*$

is a little odd. I think you should reread the documentation about character classes in regular expressions -- in this case, it is the same in Perl as it is in Posix extended REs. [W|M] matches any of the three characters W , | or M . Similarly, [AZ|0-9|\\.]{1,} matches one or more of a letter, a digit, the character | or the character . .

The backslash is irrelevant since it only escapes the . in the string literal, where escaping is unnecessary. If you had compiled with warnings enabled, -Wall , your C compiler would probably have warned you that the escape sequence is not legal. If you had actually passed the backslash on to the regex library, it would have interpreted it as another possible match for the character class.

Also, {1,} can be conveniently written as + , both in Perl and in Posix Extended REs.

In short, what you probably wanted was:

reti = regcomp(&regex, "^([WM])[[:space:]]([A-Z0-9.]+)[[:space:]]*$", REG_EXTENDED)

You could also use

reti = regcomp(&regex, "^([WM])[[:space:]]([[:alnum:].]+)[[:space:]]*$", REG_EXTENDED)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM