简体   繁体   中英

Flex/Lex: Regular Expression matches double characters

I have a flex program written in C++ that needs to complete the following rules:

I want yytext to accept the following:
○ Zero or one of the following characters ABCDEFGH

For example - input:
"triangle ABC" is a valid shape and I want the program to print "Valid shape"
"triangle AAC" is not a valid shape because it contains a double A and I want the program to print nothing in this case
"triangle ABCD" is not a valid shape because it contains four letters and I want the program to print nothing in this case too.

The code below and what regular expressions I tried so far:

%{
    /** Methods and Variables initialization **/
   
%}

corner corner" "[A-H]
line line" "[A-H]{2}
triangle triangle" "[A-H]{3}
square rectangle" "[A-H]{4}
poly pentagon" "[A-H]{5}
hexa hexagon" "[A-H]{6}
hepta heptagon" "[A-H]{7}
octa octagon" "[A-H]{8}

/** Below is the rule section -- yytext is the matched string returned to the program **/
%%
{corner} 
{line} |
{triangle} |  
{square}  |
{poly} |
{hexa} |
{hepta} | 
{octa} {   
     printf("Valid shape: %s", yytext);
}
.
%%

int main() {
    yylex();    
    return 0;
}

// yywrap() - wraps the above rule section 
int yywrap(void)
{
   return 1;
}


The current input:
triangle AAC
The current output:
Valid shape: triangle AAC (We don't want that)

The current input:
triangle AB
The current output:
Valid shape: triangle ABC

This is not the sort of problem for which you would typically use (f)lex, since the base lexical analysis is trivial (it could be done by simply splitting the line at the space) and detailed error analysis is a bit outside of (f)lex's comfort zone, specifically because there's no way to match "a string containing the same character twice" using a regular expression.

Still, as shown by the question asked by one of your classmates , it can be done with (f)lex by taking advantage of the scanner's ordering rules:

  1. Always use the longest possible match.
  2. If two or more rules would qualify, choose the first one.

That doesn't get around the question of duplicate characters. The only way to solve that is to enumerate all possibilities, of which there are eight in this case. A simpler way of doing that than that proposed in the linked question is [AH]*A[AH]*A[AH]*|[AH]*B[AH]*B[AH]*|[AH]*C[AH]*C[AH]*... .

That let's you create an ordered set of rules something like this:

  1. Match lines with duplicate characters
  2. Match lines with too many characters
  3. Match lines with exactly the right number of characters
  4. Anything else is an error. (Too few characters, invalid shape name, invalid letter, etc.)

So that might include this (leaving out the definitions of the two macros, which is straightforward but tedious):

  /* 1. Dups */
[a-z]+\ {dups}$  { err("Duplicate letter"); }
  /* 2. Too long */
{valid}[A-H]+$   { err("Too long"); }
  /* 3. Just right */
{valid}$         { printf("Valid: %s\n", yytext); }
  /* 4. Anything else */
.+               { err("Too short or invalid character"); }
  /* Ignore newlines */
\n               ;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM