I'm learning flex/bison for parsing technology. Book flex & bison shows such an flex example:
UCN (\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8})
{%
...
%}
%%
\"([^\"\\]|\\['"?\\abfnrtv]|\\[0-7]{1,3}|\\[Xx][0-9a-fA-F]+|{UCN})+\" { ... save token here }
%%
I have no idea about these parts inside of the regex:
[^\\"\\\\]
means do not match \\"
or \\\\
? if so, why need specify this don't match ? since \\"
and \\\\
seems not appear in the other group.\\\\[0-7]{1,3}
mean? \\\\[Xx][0-9a-fA-F]
mean? UCN
mean utf-8? That regular expression matches the following:
"
character,[^\\"\\\\]
- Any character other than "
or \\
\\\\['"?\\\\abfnrtv]
- A \\
followed by any of '
, "
, ?
, \\
, a
, b
, f
, n
, r
, t
, or v
. \\\\[0-7]{1,3}
- A \\
followed by one to three octal digits. \\\\[Xx][0-9a-fA-F]+
- A \\
followed by X
or x
followed by one or more hexadecimal digits. {UCN}
, which expands to (\\\\u[0-9a-fA-F]{4}|\\\\U[0-9a-fA-F]{8})
- Either of the following:
\\\\u[0-9a-fA-F]{4}
- A \\
followed by u
followed by four hexadecimal digits \\\\U[0-9a-fA-F]{8}
- A \\
followed by U
followed by eight hexadecimal digits "
characterNote that this isn't actually a correct pattern for matching all C++ string literals because
""
)x
. A better pattern for matching those would be \\\\x[0-9a-fA-F]+
For more info about what all of the C++ escape sequences mean, see this page .
To answer your specific questions:
\\
denotes an escape sequence, which is handled by the other options, and an un-escaped "
means the end of the string literal. The generic "any character" match doesn't match either of those characters so that they can be matched by other parts of the expression. \\\\[0-7]{1,3}
means a \\
followed by one to three octal digits.\\\\[Xx][0-9a-fA-F]+
means a \\
followed by X
or x
followed by one or more hexadecimal digitsUCN
is short for Universal Character Name. It denotes a Unicode character, but doesn't say anything about its encoding.[^...]
means match any single element that is not ...
\\\\[0-7]{1,3}
means to match a \\ followed by one to three characters from the set 0-7 (the matches need not be the same character, for instance "\\123" is matched)
\\\\[Xx][0-9a-fA-F]
means to match a \\ followed by either an x or X followed by a character from the set 0-9a-fA-F
UCN is an example of a lex custom definition, such definitions allow for a regex pattern to be repeated later without needing to copy the entire pattern; instead it can just be enclosed in curly braces {UCN}
I suggest you find material about regexes if the first three are really that much of a confusion, the flex manual can tell you about definitions: https://www.cs.virginia.edu/~cr4bd/flex-manual/Definitions-Section.html#Definitions-Section
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.