简体   繁体   中英

How to understand the flex C/C++ string literal regex?

I'm learning flex/bison for parsing technology. Book flex & bison shows such an flex example:


UCN (\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8})

{%
...
%}

%%

\"([^\"\\]|\\['"?\\abfnrtv]|\\[0-7]{1,3}|\\[Xx][0-9a-fA-F]+|{UCN})+\" { ... save token here }

%%

I have no idea about these parts inside of the regex:

  1. does [^\\"\\\\] means do not match \\" or \\\\ ? if so, why need specify this don't match ? since \\" and \\\\ seems not appear in the other group.
  2. what does \\\\[0-7]{1,3} mean?
  3. what does \\\\[Xx][0-9a-fA-F] mean?
  4. does UCN mean utf-8?

That regular expression matches the following:

  • A " character,
  • Followed by any combination of one or more of the following:
    • [^\\"\\\\] - Any character other than " or \\
    • \\\\['"?\\\\abfnrtv] - A \\ followed by any of ' , " , ? , \\ , a , b , f , n , r , t , or v .
    • \\\\[0-7]{1,3} - A \\ followed by one to three octal digits.
    • \\\\[Xx][0-9a-fA-F]+ - A \\ followed by X or x followed by one or more hexadecimal digits.
    • {UCN} , which expands to (\\\\u[0-9a-fA-F]{4}|\\\\U[0-9a-fA-F]{8}) - Either of the following:
      • \\\\u[0-9a-fA-F]{4} - A \\ followed by u followed by four hexadecimal digits
      • \\\\U[0-9a-fA-F]{8} - A \\ followed by U followed by eight hexadecimal digits
  • Followed by a closing " character

Note that this isn't actually a correct pattern for matching all C++ string literals because

  • It doesn't match the empty string ( "" )
  • Hex escape codes must begin with a lower-case x . A better pattern for matching those would be \\\\x[0-9a-fA-F]+

For more info about what all of the C++ escape sequences mean, see this page .

To answer your specific questions:

  1. \\ denotes an escape sequence, which is handled by the other options, and an un-escaped " means the end of the string literal. The generic "any character" match doesn't match either of those characters so that they can be matched by other parts of the expression.
  2. Answered above: \\\\[0-7]{1,3} means a \\ followed by one to three octal digits.
  3. Answered above: \\\\[Xx][0-9a-fA-F]+ means a \\ followed by X or x followed by one or more hexadecimal digits
  4. UCN is short for Universal Character Name. It denotes a Unicode character, but doesn't say anything about its encoding.
  1. [^...] means match any single element that is not ...

  2. \\\\[0-7]{1,3} means to match a \\ followed by one to three characters from the set 0-7 (the matches need not be the same character, for instance "\\123" is matched)

  3. \\\\[Xx][0-9a-fA-F] means to match a \\ followed by either an x or X followed by a character from the set 0-9a-fA-F

  4. UCN is an example of a lex custom definition, such definitions allow for a regex pattern to be repeated later without needing to copy the entire pattern; instead it can just be enclosed in curly braces {UCN}

I suggest you find material about regexes if the first three are really that much of a confusion, the flex manual can tell you about definitions: https://www.cs.virginia.edu/~cr4bd/flex-manual/Definitions-Section.html#Definitions-Section

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM