简体   繁体   中英

Multi-line match inside literals in Flex

I am trying to match text inside %[ and ]% in single or multiple lines. First thing I tried was:

\%\[(.*?)\]\%              return MULTILINE_TEXT;

but this works only for single line cases, not for multiple lines. So, I thought I could use /s :

/\%\[(.*?)\]\%/s           return MULTILINE_TEXT;

But flex see this as an invalid rule. The last thing I tried was:

\%\[((.*?|\n)*?)\]\%       return MULTILINE_TEXT;

which seemed to work, but it doesn't stop at the first ]% . In the following example:

%[ Some text ...
   Some text ... ]%

... other stuff ...

%[ Some more text ...
   Some more text ... ]%

flex will return the entire thing as a single token. What can I do?

Note that *? is not treated as a non-greedy match by flex.

Flex does support some regex flags, but its syntax is a little different than most regex libraries. For example, you can change the meaning of . by setting the s flag; the change applies to the region within the parentheses (and not following the flag setting, as in PCRE):

"%["(?s:.*)"%]"

It's more common to see the lex-compatible usage:

"%["(.|\n)*"%]"

You can also use the x flag for slightly more readable regexes:

(?xs: "%[" .* "%]" )

(The x flag does not work in definitions, only in pattern rules.)

Quoted strings (as above) is another (f)lex-specific syntax, which can be more readable than backslash escapes, although backslash escapes also work. But flex does not implement PCRE/Gnu/JS extensions such as \\w and \\s .

See the flex manual for a complete guide to flex regexes; it's definitely worth reading if you are used to other regex syntaxes.

You will probably find it disappointing that (f)lex does not support many common regex extensions, including non-greedy matches. That makes it awkward to write patterns for patterns terminated by multiple characters, as with your example. If the delimiters %[ and %] cannot be nested, so that you really want the match to end with the first %] , you could use something like this:

%\[([^%]|%+[^]])*%+\]   or  (?x: "%[" ( [^%] | %+ [^]] )* %* "%]" ) 

That's a bit hard to read, but it is precise: %[ followed by any number of repetitions of either a character other than % or a sequence of % followed by something other than ] , ending with a sequence of % followed by a ] .

In the above pattern, you need %+ rather than % to deal with strings like:

%[%% text surrounded by percents%%%]

A more readable solution which also allows for nested %[ is to use start conditions . There's a complete example of a very similar solution in this answer .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM