简体   繁体   中英

How to achieve capturing groups in flex lex?

I wanted to match for a string which starts with a '#', then matches everything until it matches the character that follows '#'. This can be achieved using capturing groups like this: #(.)[^(?1)]*(?1) ( EDIT this regex is also erroneous). This matches #$foo$ , does not match #%bar& , matches first 6 characters of #"foo"bar .

But since flex lex does not support capturing groups, what is the workaround here?

As you say, (f)lex does not support capturing groups, and it certainly doesn't support backreferences.

So there is no simple workaround, but there are workarounds. Here are a few possibilities:

  1. You can read the input one character at a time using the input() function, until you find the matching character (but you have to create your own buffer to store the characters, because characters read by input() are not added to the current token). This is not the most efficient because reading one character at a time is a bit clunky, but it's the only interface that (f)lex offers. (The following snippet assumes you have some kind of expandable stringBuilder; if you are using C++, this would just be replaced with a std::string .)

     #. { StringBuilder sb = string_builder_new(); int delim = yytext[1]; for (;;) { int next = input(); if (next == delim) break; if (next == EOF ) { /* Signal error */; break; } string_builder_addchar(next); } yylval = string_builder_release(); return DELIMITED_STRING; }
  2. Even less efficiently, but perhaps more conveniently, you can get (f)lex to accumulate the characters in yytext using yymore() , matching one character at a time in a start condition:

     %x DELIMITED %% int delim; #. { delim = yytext[1]; BEGIN(DELIMITED); } <DELIMITED>.|\n { if (yytext[0] == delim) { yylval = strdup(yytext); BEGIN(INITIAL); return DELIMITED_STRING; } yymore(); } <DELIMITED><<EOF>> { /* Signal unterminated string error */ }
  3. The most efficient solution (in (f)lex) is to just write one rule for each possible delimiter. While that's a lot of rules, they could be easily generated with a small script in whatever scripting language you prefer. And, actually, there are not that many rules, particularly if you don't allow alphabetic and non-printing characters to be delimiters. This has the additional advantage that if you want Perl-like parenthetic delimiters ( #(Hello) instead of #(Hello( ), you can just modify the individual pattern to suit (as I've done below). [Note 1] Since all the actions are the same; it might be easier to use a macro for the action, making it easier to modify.

     /* Ordinary punctuation */ #:[^:]*: { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; } #:[^:]*: { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; } #,[^;]*; { yylval = strndup(yytext + 2. yyleng - 3). return DELIMITED_STRING. } #\,[^;]*\; { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; } /* Matched pairs */ #<[^>]*> { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; } #\[[^]]*] { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; } /* Trap errors */ # { /* Report unmatched or invalid delimiter error */ }

    If I were writing a script to generate these rules, I would use hexadecimal escapes for all the delimiter characters rather than trying to figure out which ones needed escapes.


Notes:

  1. Perl requires nested balanced parentheses in constructs like that. But you can't do that with regular expressions; if you wanted to reproduce Perl behaviour, you'd need to use some variation on one of the other suggestions. I'll try to revisit this answer later to address that feature.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM